None
IS 6813-001 Spring 2023 MSBA CapstoneCompletion
Project Title : Price Elasticity
Submitted by : Atul Sehgal
Swire Coca Cola would like to understand how sensitive it's product are to price changes. In other words, the company would like to know how elastic is the demand wrt. changes in the price of the products. The price elasticity can vary across product types, geographies as well as channels (Super Market, Convenience Retail, Club Channel etc.).
Although there can be many factors other than price (e.g. promotions, events, holidays) that can affect the demand of products, the objective here is to solely understand the impact of price changes on demand.
Knowing the price elasticity can help Swire decide optimal price points of products to:
1) maximize sales volume
2) maximize gross margin
3) maybe maximize both
The optimal price point can vary across product types, geographies, and channels.
The success of the project can be judged in terms tangible/quantifiable dollar benefit potential this project can provide.
The benefit can be towards improvement of top-line or bottom-line or both.
Looking at the problem statement and data provided to execute the work, the general character of analytics approach here is a supervised machine learning approach.
The machine will learn from variation of sales volume (dependent variable) wrt. variation in price (independent variable).
The mathematical objective is to understand lift/dip in sales volume wrt. unit variation in price points.
The mathematical models need to be specific to certain markets and certain products. Hence, the problem needs to be answered via several supervised machine learning models built on several micro analysis universe.
In Scope:
Out of Scope:
What might be added later:
Who is going to execute the project? : Atul Sehgal
When will the project be finished? : 25th April 2023
Project Milestones:
Before we begin any kind of analysis, we need to make sure that we have the right set of python libraries installed and imported.
pip install pyxlsb
Requirement already satisfied: pyxlsb in /Users/atulsehgal/opt/anaconda3/lib/python3.9/site-packages (1.0.10) Note: you may need to restart the kernel to use updated packages.
pip install datetime
Requirement already satisfied: datetime in /Users/atulsehgal/opt/anaconda3/lib/python3.9/site-packages (5.0) Requirement already satisfied: pytz in /Users/atulsehgal/opt/anaconda3/lib/python3.9/site-packages (from datetime) (2022.1) Requirement already satisfied: zope.interface in /Users/atulsehgal/opt/anaconda3/lib/python3.9/site-packages (from datetime) (5.4.0) Requirement already satisfied: setuptools in /Users/atulsehgal/opt/anaconda3/lib/python3.9/site-packages (from zope.interface->datetime) (63.4.1) Note: you may need to restart the kernel to use updated packages.
pip install kneed
Requirement already satisfied: kneed in /Users/atulsehgal/opt/anaconda3/lib/python3.9/site-packages (0.8.2) Requirement already satisfied: scipy>=1.0.0 in /Users/atulsehgal/opt/anaconda3/lib/python3.9/site-packages (from kneed) (1.9.1) Requirement already satisfied: numpy>=1.14.2 in /Users/atulsehgal/opt/anaconda3/lib/python3.9/site-packages (from kneed) (1.21.5) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import datetime as dt
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
from matplotlib import text
import seaborn as sns
import numpy as np
import warnings
from kneed import KneeLocator
from IPython.display import HTML, display
import random
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import statsmodels.api as sm
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import plot_partial_dependence
import pickle
The file with the data has .xlsb extension. To read this file into pandas dataframe we will have to use pandas' read_excel() method with pyxlsb engine
df=pd.read_excel('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Data/Sales_VS_Price-All Markets.xlsb',engine='pyxlsb')
# see top 10 rows
df.head(10)
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | PRICE | EQUIVALENT_VOLUME | UNITS | STD_PHYSICAL_VOL | NUMBER_OF_STORES | NUMBER_OF_STORES_SELLING | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | M00217 | P002496971635 | 44197 | 88.230 | 17.250 | 23.000 | 11.500 | 21.000 | 3.000 |
| 1 | M00513 | P002496971635 | 44197 | 24.280 | 4.500 | 6.000 | 3.000 | 52.000 | 3.000 |
| 2 | M00207 | P011171232793 | 44400 | 4961.504 | 951.229 | 2701.715 | 337.714 | 566.353 | 441.526 |
| 3 | M00517 | P010091774557 | 44218 | 5.910 | 0.750 | 1.000 | 0.500 | 154.000 | 1.000 |
| 4 | M00320 | P144910371587 | 44435 | 73.580 | 13.866 | 221.850 | 18.487 | 1887.004 | 36.975 |
| 5 | M00520 | P004939001795 | 44743 | 1.990 | 0.440 | 1.000 | 0.417 | 1828.006 | 1.000 |
| 6 | M01015 | P014558461251 | 44197 | 6.180 | 0.167 | 2.000 | 0.083 | 2488.006 | 1.000 |
| 7 | M01316 | P010091774557 | 44197 | 0.000 | 0.750 | 1.000 | 0.500 | 149.000 | 1.000 |
| 8 | M01120 | P006198451610 | 44631 | 17803.505 | 1206.475 | 4454.678 | 742.446 | 439.420 | 330.484 |
| 9 | M00017 | P004939001795 | 44701 | 2.490 | 0.440 | 1.000 | 0.417 | 2400.996 | 1.000 |
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
So, here are the columns in our dataset..
df.columns
Index(['MARKET_ID_BLINDED', 'PRODUCT_ID_BLINDED', 'WEEK_ENDING_FRIDAY',
'PRICE', 'EQUIVALENT_VOLUME', 'UNITS', 'STD_PHYSICAL_VOL',
'NUMBER_OF_STORES', 'NUMBER_OF_STORES_SELLING'],
dtype='object')
Here's my understanding of what each of these column means:
For this analysis, amongst the three volume indicators of 'EQUIVALENT_VOLUME', 'UNITS', and 'STD_PHYSICAL_VOL',I have decided to go with 'STD_PHYSICAL_VOL'.
Let's look at the number of rows and columns in the data
print("Number of Rows =",df.shape[0])
print("Number of Columns =",df.shape[1])
Number of Rows = 853321 Number of Columns = 9
Display data types of columns in the data
# print column data types
df.dtypes
MARKET_ID_BLINDED object PRODUCT_ID_BLINDED object WEEK_ENDING_FRIDAY int64 PRICE float64 EQUIVALENT_VOLUME float64 UNITS float64 STD_PHYSICAL_VOL float64 NUMBER_OF_STORES float64 NUMBER_OF_STORES_SELLING float64 dtype: object
Looking at the table above, we can say that the date column 'WEEK_ENDING_FRIDAY' is read as an integer. We will have to convert this column into date.
The date is read from an excel. Integers in date column are excel's conversion of date to an integer. Since excel dates start from 01900-01-01 [yyyy-mm-dd] , the easiest thing is to construct a TimedeltaIndex from the floats and add this to the scalar datetime for 1900,1,1.
On further validating the dates against the provided .xlsb file, we realize that there is still a two day gap hence the best start date to use is 1899-12-20 [yyyy-mm-dd] i.e. 1900-01-01 minus 2 days.
df['WEEK_ENDING_FRIDAY'] = pd.TimedeltaIndex(df['WEEK_ENDING_FRIDAY'], unit='d') + dt.datetime(1899,12,30)
# examine the data
df
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | PRICE | EQUIVALENT_VOLUME | UNITS | STD_PHYSICAL_VOL | NUMBER_OF_STORES | NUMBER_OF_STORES_SELLING | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | M00217 | P002496971635 | 2021-01-01 | 88.230 | 17.250 | 23.000 | 11.500 | 21.000 | 3.000 |
| 1 | M00513 | P002496971635 | 2021-01-01 | 24.280 | 4.500 | 6.000 | 3.000 | 52.000 | 3.000 |
| 2 | M00207 | P011171232793 | 2021-07-23 | 4961.504 | 951.229 | 2701.715 | 337.714 | 566.353 | 441.526 |
| 3 | M00517 | P010091774557 | 2021-01-22 | 5.910 | 0.750 | 1.000 | 0.500 | 154.000 | 1.000 |
| 4 | M00320 | P144910371587 | 2021-08-27 | 73.580 | 13.866 | 221.850 | 18.487 | 1887.004 | 36.975 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 853316 | M00077 | P056214771244 | 2021-06-04 | 444.960 | 37.750 | 453.000 | 18.875 | 44.000 | 40.000 |
| 853317 | M00207 | P005365771586 | 2022-01-21 | 3950.244 | 698.000 | 837.600 | 279.200 | 590.018 | 206.640 |
| 853318 | M00077 | P014842881625 | 2021-01-22 | 770.370 | 31.167 | 374.000 | 15.583 | 44.000 | 36.000 |
| 853319 | M00207 | P018021618729 | 2022-03-18 | 1012.722 | 65.091 | 781.095 | 32.546 | 596.413 | 197.019 |
| 853320 | M00207 | P087414971407 | 2022-01-28 | 11020.342 | 1565.313 | 10733.572 | 894.464 | 590.018 | 500.030 |
853321 rows × 9 columns
# print column data types
df.dtypes
MARKET_ID_BLINDED object PRODUCT_ID_BLINDED object WEEK_ENDING_FRIDAY datetime64[ns] PRICE float64 EQUIVALENT_VOLUME float64 UNITS float64 STD_PHYSICAL_VOL float64 NUMBER_OF_STORES float64 NUMBER_OF_STORES_SELLING float64 dtype: object
The column 'WEEK_ENDING_FRIDAY' now appears as a datetime column
To be double sure about the date conversion, let's display the day of the week corresponding to the converted date. We should only get Friday.
The weekday() method returns the day of the week as an integer, where Monday is 0 and Sunday is 6. Since we expect our dates to fall on Fridays, we should be getting 4.
set([dt.datetime.weekday(x) for x in df['WEEK_ENDING_FRIDAY']])
{4}
In the current data, 'PRICE' is a derived from volume. i.e. PRICE = VOLUME x UNIT_PRICE.
For establishing price elasticity, we need to determine this UNIT_PRICE. [We want to establish how change in unit price of a product effects it's sales].
# Create UNIT_PRICE
df['UNIT_PRICE'] = df['PRICE']/df['STD_PHYSICAL_VOL']
Let's see if any of the columns have missing values.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 853321 entries, 0 to 853320 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MARKET_ID_BLINDED 853321 non-null object 1 PRODUCT_ID_BLINDED 853321 non-null object 2 WEEK_ENDING_FRIDAY 853321 non-null datetime64[ns] 3 PRICE 853321 non-null float64 4 EQUIVALENT_VOLUME 853321 non-null float64 5 UNITS 853321 non-null float64 6 STD_PHYSICAL_VOL 853321 non-null float64 7 NUMBER_OF_STORES 853321 non-null float64 8 NUMBER_OF_STORES_SELLING 853321 non-null float64 9 UNIT_PRICE 853321 non-null float64 dtypes: datetime64[ns](1), float64(7), object(2) memory usage: 65.1+ MB
All the columns have 100% complete values
Let's verify the time span of the data
print("Minimum Date:",min(df['WEEK_ENDING_FRIDAY']))
print("Maximum Date:",max(df['WEEK_ENDING_FRIDAY']))
Minimum Date: 2021-01-01 00:00:00 Maximum Date: 2022-10-21 00:00:00
The duration/span of this data is ~ 1 year and 10 months
Let's look at descriptive stats of numerical columns in the data
# display value up 2 decimal places
pd.options.display.float_format = '{:,.2f}'.format
# descriptive stats for numerical columns
df.describe()
| PRICE | EQUIVALENT_VOLUME | UNITS | STD_PHYSICAL_VOL | NUMBER_OF_STORES | NUMBER_OF_STORES_SELLING | UNIT_PRICE | |
|---|---|---|---|---|---|---|---|
| count | 853,321.00 | 853,321.00 | 853,321.00 | 853,321.00 | 853,321.00 | 853,321.00 | 853,321.00 |
| mean | 5,394.70 | 453.98 | 2,108.52 | 251.68 | 774.16 | 202.19 | 33.08 |
| std | 19,288.22 | 2,350.04 | 7,225.20 | 1,436.98 | 818.07 | 350.93 | 18.14 |
| min | 0.00 | 0.01 | 0.04 | 0.03 | 21.00 | 1.00 | 0.00 |
| 25% | 116.07 | 7.04 | 37.00 | 4.00 | 158.00 | 12.00 | 17.68 |
| 50% | 757.28 | 46.88 | 274.00 | 26.25 | 444.68 | 59.17 | 28.55 |
| 75% | 3,825.98 | 253.96 | 1,466.00 | 141.12 | 1,799.00 | 233.59 | 47.51 |
| max | 1,069,365.18 | 218,124.44 | 310,620.78 | 145,416.30 | 2,489.99 | 2,489.99 | 577.59 |
The data looks pretty clean with no negative or missing values.
The table in the above section gives a good idea about the distribution of numeric values. However, to gain understanding on skewness and kurtosis it will be best to plot the distribution plots.
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
warnings.filterwarnings('ignore')
# keep only numeric columns of interest
df_num=df[['PRICE','STD_PHYSICAL_VOL','UNIT_PRICE','NUMBER_OF_STORES','NUMBER_OF_STORES_SELLING']]
fig, axes = plt.subplots(nrows = 3, ncols = 2) # axes is 2d array (3x3)
axes = axes.flatten() # Convert axes to 1d array of length 9
fig.set_size_inches(15, 15)
for ax, col in zip(axes, df_num.columns):
sns.distplot(df_num[col], ax = ax, color='red')
ax.set_title(col)
1) PRICE and STD_PHYSICAL_VOL are right-skewed with very high kurtosis.
2) NUMBER_OF_STORES appears to be multimodal [has 4 modes].
3) NUMBER_OF_STORES_SELLING is right-skewed with high kurtosis.
4) UNIT_PRICE is right-skewed with high kurtosis.
For the purpose of this project, outlier values may not necessarily mean bad data.
However, we also do not want the model to be skewed and fail with generalization.
For now I have decided to keep the outliers. My plan is to build separate models for every product in every market. I think , working on smaller individual analysis universe for each model will automatically take care of the outliers.
At this point, my plan is to harness the higher values of outliers into the model instead of getting rid of them. These higher outlier values account for significant % of sales.
During exploratory data analysis, I get rid of low performing products in section 4.5.5.1
# remove outliers earlier but for now keeping them. df2 was supposed to be a dataframe after removing outliers.
df2= df
Let's look at descriptive stats of categorical columns in the data
# descriptive stats for categorical columns
df.describe(include='object')
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | |
|---|---|---|
| count | 853321 | 853321 |
| unique | 20 | 1827 |
| top | M00207 | P000155337256 |
| freq | 71945 | 1900 |
There are 20 unique markets and 1,827 unique products
The following key questions come to my kind based on the exploration of data so far:
1) Does high unit price awlays lead to decline in sales volume?
a) Are there any exceptions to this rule?
b) Is this true for all the markets?
c) Is this true for all the products?
2) Are there few high volume markets that account for significant portion of sales volume?
3) Are there few high volume products that account for significant portion of sales volume?
4) Are number of stores and number of stores selling correlated?
5) Is the sales volume dicrectly proportional to number of stores?
6) Is the sales volume dicrectly proportional to number of stores selling?
The above questions can be converted into six simple hypotheses. We can test these hypotheses at multiple levels of markets and products.
Hypothesis 1: Higher unit price will result in lower sales volume.
Hypothesis 2: A few high volume markets would account for significant portion of sales.
Hypothesis 3: A few high volume products would account for significant portion of sales.
Hypothesis 4: Higher number of stores would always mean higher number of stores selling.
Hypothesis 5: Higher number of stores would result in higher sales volume.
Hypothesis 6: Higher number of stores selling would result in higher sales volume.
Let's examine Hypothesis 1 on the overall data
Here's the scatter plot for Unit Price V/S Standard Physical Volume
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.scatterplot(x=df2["UNIT_PRICE"], y=df2["STD_PHYSICAL_VOL"], alpha=0.1, color='red')
p.set_xlabel("UNIT_PRICE", fontsize = 20)
p.set_ylabel("STD_PHYSICAL_VOL", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('Impact of UNIT PRICE on Volume', fontsize = 26)
plt.title('Hypothesis 1: Higher unit price will result in lower sales volume', fontsize = 20)
Text(0.5, 1.0, 'Hypothesis 1: Higher unit price will result in lower sales volume')
Zooming in ...
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.scatterplot(x=df2["UNIT_PRICE"], y=df2["STD_PHYSICAL_VOL"], alpha=0.1, color='red')
p.set_xlabel("UNIT_PRICE", fontsize = 20)
p.set_ylabel("STD_PHYSICAL_VOL", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.xlim(0,125)
plt.ylim(-2000,15000)
plt.suptitle('Impact of UNIT PRICE on Volume', fontsize = 26)
plt.title('Hypothesis 1: Higher unit price will result in lower sales volume', fontsize = 20)
Text(0.5, 1.0, 'Hypothesis 1: Higher unit price will result in lower sales volume')
We see a downward trend from left to right but let's confirm this via a linear regression line
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.scatterplot(data=df2, x="UNIT_PRICE", y="STD_PHYSICAL_VOL", alpha=0.1, color='red')
sns.regplot(data=df2, x="UNIT_PRICE", y="STD_PHYSICAL_VOL",fit_reg=True, scatter=False, color='black',line_kws={'linewidth':5})
p.set_xlabel("UNIT_PRICE", fontsize = 20)
p.set_ylabel("STD_PHYSICAL_VOL", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.xlim(0,125)
plt.ylim(-2000,15000)
plt.suptitle('Impact of UNIT PRICE on Volume', fontsize = 26)
plt.title('Hypothesis 1: Higher unit price will result in lower sales volume', fontsize = 20)
Text(0.5, 1.0, 'Hypothesis 1: Higher unit price will result in lower sales volume')
We notice a general trend of increase in price is related to decline in sales.
Hence, Hypothesis 1 [Higher unit price will result in lower sales volume] appears to be true on the overall data
Let's see if we can find a few handful or markets that account for significant portion of sales volume
To do this, we will have to create a diminishing returns curve.
This diminishing returns curve has stores sorted in descending order of sales volume in x-axis and corresponding cumulative sales volume in y-axis.
Let's first aggregate the data by markets and sort the markets in descending order of sales volume.
# calculate market-wise summary for vol and price
pd.options.display.float_format = '{:,.2f}'.format
df2_market_summary=df2.groupby(['MARKET_ID_BLINDED'],as_index=False).agg(STD_PHYSICAL_VOL=pd.NamedAgg(column='STD_PHYSICAL_VOL',aggfunc=lambda x: x.sum()),
UNIT_PRICE=pd.NamedAgg(column='UNIT_PRICE',aggfunc=lambda x: x.mean()))
# sort in descending order of volume
df2_market_summary=df2_market_summary.sort_values(by=['STD_PHYSICAL_VOL'],ascending=False)
# reset the index so that it can be used as x-axis of diminishing returns curve
df2_market_summary.reset_index(inplace=True, drop=True)
df2_market_summary.reset_index(inplace=True, drop=False)
df2_market_summary
| index | MARKET_ID_BLINDED | STD_PHYSICAL_VOL | UNIT_PRICE | |
|---|---|---|---|---|
| 0 | 0 | M00207 | 51,129,493.92 | 27.86 |
| 1 | 1 | M00418 | 26,126,464.40 | 28.92 |
| 2 | 2 | M00211 | 24,928,695.28 | 28.06 |
| 3 | 3 | M01120 | 23,574,334.55 | 28.33 |
| 4 | 4 | M00017 | 19,500,063.92 | 39.73 |
| 5 | 5 | M00182 | 18,102,612.98 | 28.43 |
| 6 | 6 | M00110 | 11,749,374.43 | 40.34 |
| 7 | 7 | M01015 | 10,426,679.97 | 40.59 |
| 8 | 8 | M00320 | 9,620,011.64 | 41.40 |
| 9 | 9 | M00520 | 7,692,720.10 | 41.04 |
| 10 | 10 | M00085 | 2,526,180.04 | 36.57 |
| 11 | 11 | M00513 | 2,446,214.79 | 27.70 |
| 12 | 12 | M00077 | 1,845,603.28 | 28.75 |
| 13 | 13 | M01119 | 1,007,905.36 | 29.25 |
| 14 | 14 | M00217 | 921,219.21 | 27.62 |
| 15 | 15 | M00045 | 738,172.26 | 38.34 |
| 16 | 16 | M01316 | 735,447.77 | 36.58 |
| 17 | 17 | M00208 | 705,560.16 | 37.76 |
| 18 | 18 | M01519 | 579,719.27 | 28.61 |
| 19 | 19 | M00517 | 403,285.91 | 38.00 |
Now, let's cumulate the volumes..
# calculate cumulative volume for diminishing return curve
STD_PHYSICAL_VOL=df2_market_summary['STD_PHYSICAL_VOL']
CUMULATIVE_STD_PHYSICAL_VOL=[0]*df2_market_summary.shape[0]
CUMULATIVE_STD_PHYSICAL_VOL[0]=float(STD_PHYSICAL_VOL[0])
for i in range(1,df2_market_summary.shape[0]):
CUMULATIVE_STD_PHYSICAL_VOL[i]=CUMULATIVE_STD_PHYSICAL_VOL[i-1]+float(STD_PHYSICAL_VOL[i])
df2_market_summary['CUMULATIVE_STD_PHYSICAL_VOL']=CUMULATIVE_STD_PHYSICAL_VOL
#claculate vol percentages
df2_market_summary['STD_PHYSICAL_VOL_PERCENTAGE']= \
round(df2_market_summary['STD_PHYSICAL_VOL']/
max(df2_market_summary['CUMULATIVE_STD_PHYSICAL_VOL'])*100,1)
#claculate cumulative percentages
df2_market_summary['CUMULATIVE_STD_PHYSICAL_VOL_PERCENTAGE']= \
round(df2_market_summary['CUMULATIVE_STD_PHYSICAL_VOL']/
max(df2_market_summary['CUMULATIVE_STD_PHYSICAL_VOL'])*100,1)
# convert volume to Millions
df2_market_summary['CUMULATIVE_STD_PHYSICAL_VOL_MN']=round(df2_market_summary['CUMULATIVE_STD_PHYSICAL_VOL']/1000000,2)
df2_market_summary
| index | MARKET_ID_BLINDED | STD_PHYSICAL_VOL | UNIT_PRICE | CUMULATIVE_STD_PHYSICAL_VOL | STD_PHYSICAL_VOL_PERCENTAGE | CUMULATIVE_STD_PHYSICAL_VOL_PERCENTAGE | CUMULATIVE_STD_PHYSICAL_VOL_MN | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | M00207 | 51,129,493.92 | 27.86 | 51,129,493.92 | 23.80 | 23.80 | 51.13 |
| 1 | 1 | M00418 | 26,126,464.40 | 28.92 | 77,255,958.32 | 12.20 | 36.00 | 77.26 |
| 2 | 2 | M00211 | 24,928,695.28 | 28.06 | 102,184,653.60 | 11.60 | 47.60 | 102.18 |
| 3 | 3 | M01120 | 23,574,334.55 | 28.33 | 125,758,988.14 | 11.00 | 58.60 | 125.76 |
| 4 | 4 | M00017 | 19,500,063.92 | 39.73 | 145,259,052.06 | 9.10 | 67.60 | 145.26 |
| 5 | 5 | M00182 | 18,102,612.98 | 28.43 | 163,361,665.04 | 8.40 | 76.10 | 163.36 |
| 6 | 6 | M00110 | 11,749,374.43 | 40.34 | 175,111,039.47 | 5.50 | 81.50 | 175.11 |
| 7 | 7 | M01015 | 10,426,679.97 | 40.59 | 185,537,719.44 | 4.90 | 86.40 | 185.54 |
| 8 | 8 | M00320 | 9,620,011.64 | 41.40 | 195,157,731.08 | 4.50 | 90.90 | 195.16 |
| 9 | 9 | M00520 | 7,692,720.10 | 41.04 | 202,850,451.18 | 3.60 | 94.50 | 202.85 |
| 10 | 10 | M00085 | 2,526,180.04 | 36.57 | 205,376,631.22 | 1.20 | 95.60 | 205.38 |
| 11 | 11 | M00513 | 2,446,214.79 | 27.70 | 207,822,846.02 | 1.10 | 96.80 | 207.82 |
| 12 | 12 | M00077 | 1,845,603.28 | 28.75 | 209,668,449.29 | 0.90 | 97.60 | 209.67 |
| 13 | 13 | M01119 | 1,007,905.36 | 29.25 | 210,676,354.66 | 0.50 | 98.10 | 210.68 |
| 14 | 14 | M00217 | 921,219.21 | 27.62 | 211,597,573.87 | 0.40 | 98.50 | 211.60 |
| 15 | 15 | M00045 | 738,172.26 | 38.34 | 212,335,746.12 | 0.30 | 98.90 | 212.34 |
| 16 | 16 | M01316 | 735,447.77 | 36.58 | 213,071,193.89 | 0.30 | 99.20 | 213.07 |
| 17 | 17 | M00208 | 705,560.16 | 37.76 | 213,776,754.05 | 0.30 | 99.50 | 213.78 |
| 18 | 18 | M01519 | 579,719.27 | 28.61 | 214,356,473.32 | 0.30 | 99.80 | 214.36 |
| 19 | 19 | M00517 | 403,285.91 | 38.00 | 214,759,759.23 | 0.20 | 100.00 | 214.76 |
Now, let's plot the data with market index on a-axis and cumulative sales volume on y-axis
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.lineplot(x=df2_market_summary['index'],y=df2_market_summary['CUMULATIVE_STD_PHYSICAL_VOL_MN'],color="red")
p.set_xlabel("Markets", fontsize = 20)
p.set_ylabel("Cumulative Volume (Mn)", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('Cumulative Distribution of Volume by Markets', fontsize = 26)
plt.title('Volumes for Markets is arranged in descending order for this diminishing returns plot', fontsize = 14)
Text(0.5, 1.0, 'Volumes for Markets is arranged in descending order for this diminishing returns plot')
We should find the elbow or knee of this curve to identify the point where sales volume starts to dimish drasctically.
kn = KneeLocator(df2_market_summary['index'], df2_market_summary['CUMULATIVE_STD_PHYSICAL_VOL'], curve='concave', direction='increasing')
print(kn.knee)
8
The above output tells us that after 9th market, we experience a significant diminishing returns in the volume.
[It's 9th market and not 8th because index 0 ressponds to the first market]
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.lineplot(x=df2_market_summary['index'],y=df2_market_summary['CUMULATIVE_STD_PHYSICAL_VOL_MN'],color="red")
p.set_xlabel("Markets", fontsize = 20)
p.set_ylabel("Cumulative Volume (Mn)", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('Cumulative Distribution of Volume by Markets', fontsize = 26)
plt.title('Volumes for Markets is arranged in descending order for this diminishing returns plot', fontsize = 14)
plt.axvline(x=kn.knee,
color="black",
linestyle="dashed")
<matplotlib.lines.Line2D at 0x7f8be4702ac0>
Let's look at the values on the left-side and right-side of the knee of the curve.
df2_market_summary.loc[df2_market_summary['index']<=(kn.knee),'MARKET_GROUP']='Left'
df2_market_summary.loc[df2_market_summary['index'] > (kn.knee),'MARKET_GROUP']='Right'
df2_market_summary
| index | MARKET_ID_BLINDED | STD_PHYSICAL_VOL | UNIT_PRICE | CUMULATIVE_STD_PHYSICAL_VOL | STD_PHYSICAL_VOL_PERCENTAGE | CUMULATIVE_STD_PHYSICAL_VOL_PERCENTAGE | CUMULATIVE_STD_PHYSICAL_VOL_MN | MARKET_GROUP | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | M00207 | 51,129,493.92 | 27.86 | 51,129,493.92 | 23.80 | 23.80 | 51.13 | Left |
| 1 | 1 | M00418 | 26,126,464.40 | 28.92 | 77,255,958.32 | 12.20 | 36.00 | 77.26 | Left |
| 2 | 2 | M00211 | 24,928,695.28 | 28.06 | 102,184,653.60 | 11.60 | 47.60 | 102.18 | Left |
| 3 | 3 | M01120 | 23,574,334.55 | 28.33 | 125,758,988.14 | 11.00 | 58.60 | 125.76 | Left |
| 4 | 4 | M00017 | 19,500,063.92 | 39.73 | 145,259,052.06 | 9.10 | 67.60 | 145.26 | Left |
| 5 | 5 | M00182 | 18,102,612.98 | 28.43 | 163,361,665.04 | 8.40 | 76.10 | 163.36 | Left |
| 6 | 6 | M00110 | 11,749,374.43 | 40.34 | 175,111,039.47 | 5.50 | 81.50 | 175.11 | Left |
| 7 | 7 | M01015 | 10,426,679.97 | 40.59 | 185,537,719.44 | 4.90 | 86.40 | 185.54 | Left |
| 8 | 8 | M00320 | 9,620,011.64 | 41.40 | 195,157,731.08 | 4.50 | 90.90 | 195.16 | Left |
| 9 | 9 | M00520 | 7,692,720.10 | 41.04 | 202,850,451.18 | 3.60 | 94.50 | 202.85 | Right |
| 10 | 10 | M00085 | 2,526,180.04 | 36.57 | 205,376,631.22 | 1.20 | 95.60 | 205.38 | Right |
| 11 | 11 | M00513 | 2,446,214.79 | 27.70 | 207,822,846.02 | 1.10 | 96.80 | 207.82 | Right |
| 12 | 12 | M00077 | 1,845,603.28 | 28.75 | 209,668,449.29 | 0.90 | 97.60 | 209.67 | Right |
| 13 | 13 | M01119 | 1,007,905.36 | 29.25 | 210,676,354.66 | 0.50 | 98.10 | 210.68 | Right |
| 14 | 14 | M00217 | 921,219.21 | 27.62 | 211,597,573.87 | 0.40 | 98.50 | 211.60 | Right |
| 15 | 15 | M00045 | 738,172.26 | 38.34 | 212,335,746.12 | 0.30 | 98.90 | 212.34 | Right |
| 16 | 16 | M01316 | 735,447.77 | 36.58 | 213,071,193.89 | 0.30 | 99.20 | 213.07 | Right |
| 17 | 17 | M00208 | 705,560.16 | 37.76 | 213,776,754.05 | 0.30 | 99.50 | 213.78 | Right |
| 18 | 18 | M01519 | 579,719.27 | 28.61 | 214,356,473.32 | 0.30 | 99.80 | 214.36 | Right |
| 19 | 19 | M00517 | 403,285.91 | 38.00 | 214,759,759.23 | 0.20 | 100.00 | 214.76 | Right |
df2_market_summary.groupby(['MARKET_GROUP'],as_index=False).agg(COUNT_OF_MARKETS=pd.NamedAgg(column='MARKET_ID_BLINDED',aggfunc=lambda x: x.count()),
VOL_PERCENTAGE=pd.NamedAgg(column='STD_PHYSICAL_VOL_PERCENTAGE',aggfunc=lambda x: x.sum()))
| MARKET_GROUP | COUNT_OF_MARKETS | VOL_PERCENTAGE | |
|---|---|---|---|
| 0 | Left | 9 | 91.00 |
| 1 | Right | 11 | 9.10 |
9 out of 20 markets account for 91% of the volume
Hence, Hypothesis 2 [A few high volume markets would account for significant portion of sales] appears to be true.
Let's go back to hypothesis 1 and see if it is true for all the markets..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.lmplot(data=df2, x="UNIT_PRICE", y="STD_PHYSICAL_VOL",scatter=True,
hue='MARKET_ID_BLINDED', height=10, aspect=2.0, scatter_kws={'alpha':0.05},
line_kws={'linewidth':3}, legend=False, ci=None)
plt.xlabel("UNIT_PRICE", fontsize = 20)
plt.ylabel("STD_PHYSICAL_VOL", fontsize = 20)
plt.xticks(fontsize= 24)
plt.yticks(fontsize= 24)
plt.xlim(0,125)
plt.ylim(-2000,15000)
plt.suptitle('Impact of UNIT PRICE on Volume by Market', fontsize = 30)
plt.title('Hypothesis 1: Higher unit price will result in lower sales volume', fontsize = 20)
plt.legend(fontsize=20,loc='right')
<matplotlib.legend.Legend at 0x7f8be50d0430>
<Figure size 2000x1000 with 0 Axes>
To neatly analyze the graph, we should get rid of the dots in background and just focus on the lines..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
#p=sns.scatterplot(data=df2, x="UNIT_PRICE", y="STD_PHYSICAL_VOL", alpha=0.1, color='red')
p=sns.lmplot(data=df2, x="UNIT_PRICE", y="STD_PHYSICAL_VOL",scatter=False,
hue='MARKET_ID_BLINDED', height=10, aspect=2.0, scatter_kws={'alpha':0.05},
line_kws={'linewidth':3}, legend=False,ci=None)
plt.xlabel("UNIT_PRICE", fontsize = 20)
plt.ylabel("STD_PHYSICAL_VOL", fontsize = 20)
plt.xticks(fontsize= 24)
plt.yticks(fontsize= 24)
plt.xlim(0,125)
plt.ylim(-2000,15000)
plt.suptitle('Impact of UNIT PRICE on Volume by Market', fontsize = 30)
plt.title('Hypothesis 1: Higher unit price will result in lower sales volume', fontsize = 20)
plt.legend(fontsize=20,loc='right')
<matplotlib.legend.Legend at 0x7f8be64dcb50>
<Figure size 2000x1000 with 0 Axes>
Zooming further in..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
#p=sns.scatterplot(data=df2, x="UNIT_PRICE", y="STD_PHYSICAL_VOL", alpha=0.1, color='red')
p=sns.lmplot(data=df2, x="UNIT_PRICE", y="STD_PHYSICAL_VOL",scatter=False,
hue='MARKET_ID_BLINDED', height=10, aspect=2.0, scatter_kws={'alpha':0.05},
line_kws={'linewidth':3}, legend=False, ci=None)
plt.xlabel("UNIT_PRICE", fontsize = 20)
plt.ylabel("STD_PHYSICAL_VOL", fontsize = 20)
plt.xticks(fontsize= 24)
plt.yticks(fontsize= 24)
plt.xlim(0,125)
plt.ylim(-2000,2000)
plt.suptitle('Impact of UNIT PRICE on Volume by Market', fontsize = 30)
plt.title('Hypothesis 1: Higher unit price will result in lower sales volume', fontsize = 20)
plt.legend(fontsize=20,loc='right')
<matplotlib.legend.Legend at 0x7f8be50a1400>
<Figure size 2000x1000 with 0 Axes>
We can see that for all the 20 markets increase in price is related to decline in sales.
Hence, Hypothesis 1 [Higher unit price will result in lower sales volume] appears to be true for all the markets.
From the slope of the lines, we can say that certain markets are more price elastic than others.
Let's create a diminishing returns plot for products [similar to markets in previous section].
# calculate market-wise summary for vol and price
pd.options.display.float_format = '{:,.2f}'.format
df2_product_summary=df2.groupby(['PRODUCT_ID_BLINDED'],as_index=False).agg(STD_PHYSICAL_VOL=pd.NamedAgg(column='STD_PHYSICAL_VOL',aggfunc=lambda x: x.sum()),
UNIT_PRICE=pd.NamedAgg(column='UNIT_PRICE',aggfunc=lambda x: x.mean()))
# sort in descending order of volume
df2_product_summary=df2_product_summary.sort_values(by=['STD_PHYSICAL_VOL'],ascending=False)
# reset the index so that it can be used as x-axis of diminishing returns curve
df2_product_summary.reset_index(inplace=True, drop=True)
df2_product_summary.reset_index(inplace=True, drop=False)
df2_product_summary
| index | PRODUCT_ID_BLINDED | STD_PHYSICAL_VOL | UNIT_PRICE | |
|---|---|---|---|---|
| 0 | 0 | P008551640344 | 21,679,585.61 | 11.63 |
| 1 | 1 | P000930861733 | 10,109,905.09 | 11.47 |
| 2 | 2 | P016001323615 | 8,622,522.11 | 5.98 |
| 3 | 3 | P000155337256 | 7,376,317.01 | 11.74 |
| 4 | 4 | P000477504378 | 5,413,932.69 | 11.13 |
| ... | ... | ... | ... | ... |
| 1822 | 1822 | P001773404922 | 0.04 | 59.29 |
| 1823 | 1823 | P001769251499 | 0.04 | 56.90 |
| 1824 | 1824 | P007751606747 | 0.04 | 56.90 |
| 1825 | 1825 | P090610371021 | 0.04 | 95.00 |
| 1826 | 1826 | P000000068633 | 0.04 | 78.33 |
1827 rows × 4 columns
# calculate cumulative volume for diminishing return curve
STD_PHYSICAL_VOL=df2_product_summary['STD_PHYSICAL_VOL']
CUMULATIVE_STD_PHYSICAL_VOL=[0]*df2_product_summary.shape[0]
CUMULATIVE_STD_PHYSICAL_VOL[0]=float(STD_PHYSICAL_VOL[0])
for i in range(1,df2_product_summary.shape[0]):
CUMULATIVE_STD_PHYSICAL_VOL[i]=CUMULATIVE_STD_PHYSICAL_VOL[i-1]+float(STD_PHYSICAL_VOL[i])
df2_product_summary['CUMULATIVE_STD_PHYSICAL_VOL']=CUMULATIVE_STD_PHYSICAL_VOL
#claculate vol percentages
df2_product_summary['STD_PHYSICAL_VOL_PERCENTAGE']= \
round(df2_product_summary['STD_PHYSICAL_VOL']/
max(df2_product_summary['CUMULATIVE_STD_PHYSICAL_VOL'])*100,4)
#claculate cumulative percentages
df2_product_summary['CUMULATIVE_STD_PHYSICAL_VOL_PERCENTAGE']= \
round(df2_product_summary['CUMULATIVE_STD_PHYSICAL_VOL']/
max(df2_product_summary['CUMULATIVE_STD_PHYSICAL_VOL'])*100,4)
# convert volume to Millions
df2_product_summary['CUMULATIVE_STD_PHYSICAL_VOL_MN']=round(df2_product_summary['CUMULATIVE_STD_PHYSICAL_VOL']/1000000,2)
df2_product_summary
| index | PRODUCT_ID_BLINDED | STD_PHYSICAL_VOL | UNIT_PRICE | CUMULATIVE_STD_PHYSICAL_VOL | STD_PHYSICAL_VOL_PERCENTAGE | CUMULATIVE_STD_PHYSICAL_VOL_PERCENTAGE | CUMULATIVE_STD_PHYSICAL_VOL_MN | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | P008551640344 | 21,679,585.61 | 11.63 | 21,679,585.61 | 10.09 | 10.09 | 21.68 |
| 1 | 1 | P000930861733 | 10,109,905.09 | 11.47 | 31,789,490.70 | 4.71 | 14.80 | 31.79 |
| 2 | 2 | P016001323615 | 8,622,522.11 | 5.98 | 40,412,012.81 | 4.01 | 18.82 | 40.41 |
| 3 | 3 | P000155337256 | 7,376,317.01 | 11.74 | 47,788,329.81 | 3.43 | 22.25 | 47.79 |
| 4 | 4 | P000477504378 | 5,413,932.69 | 11.13 | 53,202,262.51 | 2.52 | 24.77 | 53.20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1822 | 1822 | P001773404922 | 0.04 | 59.29 | 214,759,759.06 | 0.00 | 100.00 | 214.76 |
| 1823 | 1823 | P001769251499 | 0.04 | 56.90 | 214,759,759.10 | 0.00 | 100.00 | 214.76 |
| 1824 | 1824 | P007751606747 | 0.04 | 56.90 | 214,759,759.15 | 0.00 | 100.00 | 214.76 |
| 1825 | 1825 | P090610371021 | 0.04 | 95.00 | 214,759,759.19 | 0.00 | 100.00 | 214.76 |
| 1826 | 1826 | P000000068633 | 0.04 | 78.33 | 214,759,759.23 | 0.00 | 100.00 | 214.76 |
1827 rows × 8 columns
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.lineplot(x=df2_product_summary['index'],y=df2_product_summary['CUMULATIVE_STD_PHYSICAL_VOL_MN'],color="red")
p.set_xlabel("products", fontsize = 20)
p.set_ylabel("Cumulative Volume (Mn)", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('Cumulative Distribution of Volume by Products', fontsize = 26)
plt.title('Volumes for products is arranged in descending order for this diminishing returns plot', fontsize = 14)
Text(0.5, 1.0, 'Volumes for products is arranged in descending order for this diminishing returns plot')
Let's find the knee/ elbow of this curve
kn = KneeLocator(df2_product_summary['index'], df2_product_summary['CUMULATIVE_STD_PHYSICAL_VOL'], curve='concave', direction='increasing')
print(kn.knee)
320
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.lineplot(x=df2_product_summary['index'],y=df2_product_summary['CUMULATIVE_STD_PHYSICAL_VOL_MN'],color="red")
p.set_xlabel("products", fontsize = 20)
p.set_ylabel("Cumulative Volume (Mn)", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('Cumulative Distribution of Volume by Products', fontsize = 26)
plt.title('Volumes for products is arranged in descending order for this diminishing returns plot', fontsize = 14)
plt.axvline(x=kn.knee,
color="black",
linestyle="dashed")
<matplotlib.lines.Line2D at 0x7f8be64bdeb0>
df2_product_summary.loc[df2_product_summary['index']<=(kn.knee),'PRODUCT_GROUP']='Left'
df2_product_summary.loc[df2_product_summary['index'] > (kn.knee),'PRODUCT_GROUP']='Right'
df2_product_summary
| index | PRODUCT_ID_BLINDED | STD_PHYSICAL_VOL | UNIT_PRICE | CUMULATIVE_STD_PHYSICAL_VOL | STD_PHYSICAL_VOL_PERCENTAGE | CUMULATIVE_STD_PHYSICAL_VOL_PERCENTAGE | CUMULATIVE_STD_PHYSICAL_VOL_MN | PRODUCT_GROUP | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | P008551640344 | 21,679,585.61 | 11.63 | 21,679,585.61 | 10.09 | 10.09 | 21.68 | Left |
| 1 | 1 | P000930861733 | 10,109,905.09 | 11.47 | 31,789,490.70 | 4.71 | 14.80 | 31.79 | Left |
| 2 | 2 | P016001323615 | 8,622,522.11 | 5.98 | 40,412,012.81 | 4.01 | 18.82 | 40.41 | Left |
| 3 | 3 | P000155337256 | 7,376,317.01 | 11.74 | 47,788,329.81 | 3.43 | 22.25 | 47.79 | Left |
| 4 | 4 | P000477504378 | 5,413,932.69 | 11.13 | 53,202,262.51 | 2.52 | 24.77 | 53.20 | Left |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1822 | 1822 | P001773404922 | 0.04 | 59.29 | 214,759,759.06 | 0.00 | 100.00 | 214.76 | Right |
| 1823 | 1823 | P001769251499 | 0.04 | 56.90 | 214,759,759.10 | 0.00 | 100.00 | 214.76 | Right |
| 1824 | 1824 | P007751606747 | 0.04 | 56.90 | 214,759,759.15 | 0.00 | 100.00 | 214.76 | Right |
| 1825 | 1825 | P090610371021 | 0.04 | 95.00 | 214,759,759.19 | 0.00 | 100.00 | 214.76 | Right |
| 1826 | 1826 | P000000068633 | 0.04 | 78.33 | 214,759,759.23 | 0.00 | 100.00 | 214.76 | Right |
1827 rows × 9 columns
df2_product_summary.groupby(['PRODUCT_GROUP'],as_index=False).agg(COUNT_OF_productS=pd.NamedAgg(column='PRODUCT_ID_BLINDED',aggfunc=lambda x: x.count()),
VOL_PERCENTAGE=pd.NamedAgg(column='STD_PHYSICAL_VOL_PERCENTAGE',aggfunc=lambda x: x.sum()))
| PRODUCT_GROUP | COUNT_OF_productS | VOL_PERCENTAGE | |
|---|---|---|---|
| 0 | Left | 321 | 91.52 |
| 1 | Right | 1506 | 8.48 |
round(321/(321+1506)*100,1)
17.6
17.6 % of products [321 out of 1,827] account for 91.5% of the volume
Hence, Hypothesis 3 [A few high volume products would account for significant portion of sales] appears to be true.
1,827 products are a lot to analyze and plot. From this point onwards we will restrict the analysis to 321 products that account for 91.5% of the volume.
# restricting the data to 321 top products
products_to_keep=df2_product_summary.loc[df2_product_summary['PRODUCT_GROUP']=='Left','PRODUCT_ID_BLINDED'] \
.drop_duplicates().to_list()
df3=df2[df2['PRODUCT_ID_BLINDED'].isin(products_to_keep)]
print("Check the count of dropped rows")
print("Before =",df2.shape[0])
print("After =",df3.shape[0])
Check the count of dropped rows Before = 853321 After = 414819
Let's go back to hypothesis 1 and see if it is true for all the top products..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.lmplot(data=df3, x="UNIT_PRICE", y="STD_PHYSICAL_VOL",scatter=True,
hue='PRODUCT_ID_BLINDED', height=10, aspect=2.0, scatter_kws={'alpha':0.05},
line_kws={'linewidth':3}, legend=False,ci=None)
plt.xlabel("UNIT_PRICE", fontsize = 20)
plt.ylabel("STD_PHYSICAL_VOL", fontsize = 20)
plt.xticks(fontsize= 24)
plt.yticks(fontsize= 24)
plt.xlim(0,100)
plt.ylim(-2000,15000)
plt.suptitle('Impact of UNIT PRICE on Volume for Top 381 Products', fontsize = 30)
Text(0.5, 0.98, 'Impact of UNIT PRICE on Volume for Top 381 Products')
<Figure size 2000x1000 with 0 Axes>
To neatly analyze the graph, we should get rid of the dots in background and just focus on the lines..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.lmplot(data=df3, x="UNIT_PRICE", y="STD_PHYSICAL_VOL",scatter=False,
hue='PRODUCT_ID_BLINDED', height=10, aspect=2.0, scatter_kws={'alpha':0.05},
line_kws={'linewidth':3}, legend=False, ci=None)
plt.xlabel("UNIT_PRICE", fontsize = 20)
plt.ylabel("STD_PHYSICAL_VOL", fontsize = 20)
plt.xticks(fontsize= 24)
plt.yticks(fontsize= 24)
plt.xlim(0,100)
plt.ylim(-2000,15000)
plt.suptitle('Impact of UNIT PRICE on Volume for Top 381 Products', fontsize = 30)
Text(0.5, 0.98, 'Impact of UNIT PRICE on Volume for Top 381 Products')
<Figure size 2000x1000 with 0 Axes>
We can see that increase in price is NOT always related to decline in sales. For some products we see an upwards trend while for some products we see a downward trend.
Hence, Hypothesis 1 [Higher unit price will result in lower sales volume] is NOT true for all the products.
Let's see how number of stores and number of stores selling are related.
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.scatterplot(data=df3, x="NUMBER_OF_STORES", y="NUMBER_OF_STORES_SELLING", alpha=0.1, color='red')
#sns.regplot(data=df2, x="NUMBER_OF_STORES", y="STD_PHYSICAL_VOL",fit_reg=True, scatter=False, color='black',line_kws={'linewidth':5})
p.set_xlabel("Number of Stores", fontsize = 20)
p.set_ylabel("Number of Stores Selling", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('Relationship between Number of Stores and Number of Stores Selling', fontsize = 26)
plt.title('Hypothesis: Higher number of stores would imply higher number of stores selling', fontsize = 20)
Text(0.5, 1.0, 'Hypothesis: Higher number of stores would imply higher number of stores selling')
'Number of stores selling' are not really cortrelated to 'Number of Stores'
However, we do see an upper cap on 'Number of Stores selling'. This upper cap is 'Number of Stores'.
Hence, Hypothesis 4 [Higher number of stores would always mean higher number of stores selling] appears to be false.
Let's see if there's a high variation in 'Number of Stores' over time..
# create data for the plot
df3_market_num_of_stores= df3[['MARKET_ID_BLINDED','WEEK_ENDING_FRIDAY','NUMBER_OF_STORES']].drop_duplicates() \
.sort_values(by=['MARKET_ID_BLINDED','WEEK_ENDING_FRIDAY'])
df3_market_num_of_stores
| MARKET_ID_BLINDED | WEEK_ENDING_FRIDAY | NUMBER_OF_STORES | |
|---|---|---|---|
| 2115 | M00017 | 2021-01-01 | 2,464.00 |
| 2133 | M00017 | 2021-01-08 | 2,464.00 |
| 2151 | M00017 | 2021-01-15 | 2,464.00 |
| 2169 | M00017 | 2021-01-22 | 2,464.00 |
| 2187 | M00017 | 2021-01-29 | 2,465.00 |
| ... | ... | ... | ... |
| 2206 | M01519 | 2022-09-23 | 25.00 |
| 2224 | M01519 | 2022-09-30 | 25.00 |
| 2242 | M01519 | 2022-10-07 | 25.00 |
| 2260 | M01519 | 2022-10-14 | 25.00 |
| 2278 | M01519 | 2022-10-21 | 25.00 |
1900 rows × 3 columns
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.lineplot(x=df3_market_num_of_stores['WEEK_ENDING_FRIDAY'],y=df3_market_num_of_stores['NUMBER_OF_STORES']
,hue=df3_market_num_of_stores["MARKET_ID_BLINDED"],legend=True, linewidth=3)
p.set_xlabel("Week", fontsize = 20)
p.set_ylabel("Number of Stores", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('Number of stores over time for each market', fontsize = 26)
plt.title('Number of stores for each market show a small variation over the time', fontsize = 14)
plt.legend(fontsize=14,loc='right')
<matplotlib.legend.Legend at 0x7f8be5546c10>
Number of stores are pretty much constant and show only a little variation over time.
Do higher number of stores relate to higher sales volume?
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.scatterplot(data=df3, x="NUMBER_OF_STORES", y="STD_PHYSICAL_VOL", alpha=0.1, color='red')
p.set_xlabel("Number of Stores", fontsize = 20)
p.set_ylabel("Volume", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('Impact of Number of Stores on Volume', fontsize = 26)
plt.title('Hypothesis: Higher number of stores will lead to higher sales volume', fontsize = 20)
Text(0.5, 1.0, 'Hypothesis: Higher number of stores will lead to higher sales volume')
No, higher number of stores do not relate to higher sales volume
Hence, Hypothesis 5 [Higher number of stores would result in higher sales volume] appears to be false.
Let's see if there's a high variation in 'Number of Stores Selling' over time..
# prepare the data for the plot
df3_market_num_of_stores_selling= df3[['MARKET_ID_BLINDED','WEEK_ENDING_FRIDAY','NUMBER_OF_STORES_SELLING']].drop_duplicates() \
.sort_values(by=['MARKET_ID_BLINDED','WEEK_ENDING_FRIDAY'])
df3_market_num_of_stores_selling
| MARKET_ID_BLINDED | WEEK_ENDING_FRIDAY | NUMBER_OF_STORES_SELLING | |
|---|---|---|---|
| 2115 | M00017 | 2021-01-01 | 1,442.30 |
| 5553 | M00017 | 2021-01-01 | 818.07 |
| 10143 | M00017 | 2021-01-01 | 1,682.52 |
| 11871 | M00017 | 2021-01-01 | 1,677.30 |
| 16551 | M00017 | 2021-01-01 | 1,444.23 |
| ... | ... | ... | ... |
| 198730 | M01519 | 2022-10-21 | 9.00 |
| 266284 | M01519 | 2022-10-21 | 7.00 |
| 269398 | M01519 | 2022-10-21 | 5.00 |
| 332830 | M01519 | 2022-10-21 | 8.00 |
| 340768 | M01519 | 2022-10-21 | 14.00 |
277161 rows × 3 columns
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.lineplot(x=df3_market_num_of_stores_selling['WEEK_ENDING_FRIDAY'],y=df3_market_num_of_stores_selling['NUMBER_OF_STORES_SELLING']
,hue=df3_market_num_of_stores_selling["MARKET_ID_BLINDED"],legend=True, linewidth=3)
p.set_xlabel("Week", fontsize = 20)
p.set_ylabel("Number of Stores Selling", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('Number of stores selling over time for each market', fontsize = 26)
plt.title('Number of stores selling for each market show a large variation over the time', fontsize = 14)
plt.legend(fontsize=14,loc='right')
<matplotlib.legend.Legend at 0x7f8be579bdf0>
Number of stores selling show large variation over time.
Possible reason for this variation could be the product delivery in these stores. Same product may not always be available at a store all the time.
Can higher number of stores selling lead to higher sales volume?
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.scatterplot(data=df3, x="NUMBER_OF_STORES_SELLING", y="STD_PHYSICAL_VOL", alpha=0.1, color='red')
sns.regplot(data=df2, x="NUMBER_OF_STORES_SELLING", y="STD_PHYSICAL_VOL",fit_reg=True, scatter=False, color='black',line_kws={'linewidth':5})
p.set_xlabel("Number of Stores Selling", fontsize = 20)
p.set_ylabel("Volume", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.ylim(0,3500)
plt.xlim(0,2500)
plt.suptitle('Impact of Number of Stores Selling on Volume', fontsize = 26)
plt.title('Hypothesis 6: Higher number of stores selling will lead to higher sales volume', fontsize = 20)
Text(0.5, 1.0, 'Hypothesis 6: Higher number of stores selling will lead to higher sales volume')
At high level, it looks like higher number of stores selling lead to higher sales volume but let's see if this is true for all the markets..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.lmplot(data=df3, x="NUMBER_OF_STORES_SELLING", y="STD_PHYSICAL_VOL",scatter=True,
hue='MARKET_ID_BLINDED', height=10, aspect=2.0, scatter_kws={'alpha':0.05},
line_kws={'linewidth':5}, legend=False, ci=None)
plt.xlabel("Number of Stores Selling", fontsize = 20)
plt.ylabel("Volume", fontsize = 20)
plt.xticks(fontsize= 24)
plt.yticks(fontsize= 24)
plt.ylim(0,3500)
plt.xlim(0,2500)
plt.suptitle('Impact of Number of Stores Selling on Volume', fontsize = 26)
plt.title('Hypothesis 6: Higher number of stores selling will lead to higher sales volume', fontsize = 20)
plt.legend(fontsize=14,loc='right')
<matplotlib.legend.Legend at 0x7f8c088fa760>
<Figure size 2000x1000 with 0 Axes>
To neatly analyze the graph, we should get rid of the dots in background and just focus on the lines..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
#p=sns.scatterplot(data=df2, x="UNIT_PRICE", y="STD_PHYSICAL_VOL", alpha=0.1, color='red')
p=sns.lmplot(data=df3, x="NUMBER_OF_STORES_SELLING", y="STD_PHYSICAL_VOL",scatter=False,
hue='MARKET_ID_BLINDED', height=10, aspect=2.0, scatter_kws={'alpha':0.05},
line_kws={'linewidth':5}, legend=False, ci=None)
plt.xlabel("Number of Stores Selling", fontsize = 20)
plt.ylabel("Volume", fontsize = 20)
plt.xticks(fontsize= 24)
plt.yticks(fontsize= 24)
plt.ylim(0,3500)
plt.xlim(0,2500)
plt.suptitle('Impact of Number of Stores Selling on Volume', fontsize = 26)
plt.title('Hypothesis 6: Higher number of stores selling will lead to higher sales volume', fontsize = 20)
plt.legend(fontsize=14,loc='right')
<matplotlib.legend.Legend at 0x7f8be574b340>
<Figure size 2000x1000 with 0 Axes>
It's interesting to see how certain markets have high velocity stores where fewer stores sell much higher volume [compared to other markets with low velocity stores].
Hypothesis 6 [Higher number of stores selling would result in higher sales volume] appears to be true.
print("Number of Rows =",df.shape[0])
print("Number of Columns =",df.shape[1])
Number of Rows = 853321 Number of Columns = 10
# descriptive stats for categorical columns
df.describe(include='object')
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | |
|---|---|---|
| count | 853321 | 853321 |
| unique | 20 | 1827 |
| top | M00207 | P000155337256 |
| freq | 71945 | 1900 |
There were 20 unique markets and 1,827 unique products.
print("Total sales Volume =", round(df['STD_PHYSICAL_VOL'].sum(),0))
Total sales Volume = 214759759.0
print("Number of Rows =",df3.shape[0])
print("Number of Columns =",df3.shape[1])
Number of Rows = 414819 Number of Columns = 10
# descriptive stats for categorical columns
df3.describe(include='object')
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | |
|---|---|---|
| count | 414819 | 414819 |
| unique | 20 | 321 |
| top | M00211 | P001358992242 |
| freq | 29644 | 1900 |
There are 20 unique markets and 321 unique products.
print("Total sales Volume =", round(df3['STD_PHYSICAL_VOL'].sum(),0))
Total sales Volume = 196537981.0
round(df3['STD_PHYSICAL_VOL'].sum(),0)/round(df['STD_PHYSICAL_VOL'].sum(),0)
0.9151527358530888
20 unique markets and 321 unique products represent 91.5% of sales
Hypothesis 1: Higher unit price will result in lower sales volume
1) True at overall level
2) True for all the markets
3) True for some products and False for some products
Hypothesis 2: A few high volume markets would account for significant portion of sales : True
Hypothesis 3: A few high volume products would account for significant portion of sales: True
Hypothesis 4: Higher number of stores would always mean higher number of stores selling: False
Hypothesis 5: Higher number of stores would result in higher sales volume: False
Hypothesis 6: Higher number of stores selling would result in higher sales volume : True
1) 9 out of 20 markets account for 91% of the volume
2) Some markets are more price elastic than others
3) 17.6 % of products [321 out of 1,827] account for 91.5% of the volume
4) Sales of some products is positively correlated with price while for some products it is negatively correlated
5) 'Number of stores' show only a little variation over time.
6) 'Number of stores selling' show large variation over time
7) Some markets are high velocity markets while some are low velocity markets
8) 20 unique markets and 321 unique products represent 91.5% of sales. This is our analysis universe for any future analysis.
9) At this point, my plan is to harness the higher values of outliers into the model instead of getting rid of them. These higher outlier values account for significant % of sales.
We know two facts below:
1) some markets are more price elastic than others and
2) sales of some products is positively correlated with price while for some products it is negatively correlated
Recommendations coming out of model can exploit this information and push for higher price points in certain geographies. These geographies may have high concentration of people from certain race and ethnicity. This may result in discriminatory price points.
Now that we have established 1) Unit price and 2) Number of stores selling have good relationship with sales volume, we can proceed with the modeling task to determine this relaionship for every product in every market.
The grain of product and market may be too granular to fit a perfect model [we will see this as we proceed], but I do not have enough information to roll data up to some higher grain.
Here are the steps I would like to follow to determine the relationships:
1) Determine Seasonality for every market - Using polynomial regression and some custom processing
2) Remove the seasonality from the volume to determine sales which is likely only effected by 1) Unit Price and 2) number of stores selling. Let's call this sales as Adjusted Sales.
3) Determine relationship of Adjusted Sales with 1) Unit Price and 2) number of stores selling [for each product in each market]. Here I am going to use the following two modeling techniques
a) Linear Regression
b) Random Forest
4) Establish which modeling technique performs better
5) Generate insights on 'Price Elasticity' from the models
6) Build **optimization/simulator engine*** for price elasticity
**Note on step 6:** I have kept the task of building the price elasticity simulator as optional for this phase. It is a separate project in itself. I have mentioned it here to show the vision of how these models can be made operational to add value to the business. I might be able to build a simple simulator in excel before the final presentation.
Before we start executing our models, I have declared some some functions that I will be using throughout the modeling process.
Declaring functions is a fundamental aspect of programming that can make code more efficient, easier to maintain, and less error-prone.
The two function declared below are used together [one calls another].
The model_fit function fits a polynomial regression of specified degree. Polynomial regression is a type of regression analysis in which the relationship between a dependent variable and one or more independent variables is modeled as an nth-degree polynomial. In other words, instead of fitting a linear model to the data, polynomial regression uses a polynomial function to fit a curve to the data. I am using polynomial regression to smoothen the seasonality curves. The function tries out polynomial regressions from degrees 1 to 11.
Function adjR returns the R-sq value for every fitted polynomial model. This helps to just best degree for the polynomial regression.
def adjR(x, y, degree):
results = {}
coeffs = np.polyfit(x, y, degree)
p = np.poly1d(coeffs)
yhat = p(x)
ybar = np.sum(y)/len(y)
ssreg = np.sum((yhat-ybar)**2)
sstot = np.sum((y - ybar)**2)
results = 1- (((1-(ssreg/sstot))*(len(y)-1))/(len(y)-degree-1))
return results
def model_fit(df,num_of_years,col_year,col_week,col_vol):
pred_vol =[]
for y in range(1,num_of_years+1):
print('Year : '+str(y))
df_year=df[df[col_year]==y]
list_r2=[]
for r in range(1,11):
list_r2.append(adjR(df[col_week], df[col_vol], r))
best_degree=list_r2.index(max(list_r2))+1
print('Best Degree : '+str(best_degree))
model_yr = np.poly1d(np.polyfit(df_year[col_week], df_year[col_vol], best_degree))
pred_yr=model_yr(df_year[col_week]).tolist()
pred_vol= pred_vol + pred_yr
return pred_vol
This function determines seasonality for week 53 as average of weeks 51, 52, 1 and 2.
def func_add_week_53_seasonality(df):
week_51= float(df.loc[df['WEEK_OF_THE_YEAR']==51,'WEEKLY_SEASONALITY_FACTOR'])
week_52= float(df.loc[df['WEEK_OF_THE_YEAR']==52,'WEEKLY_SEASONALITY_FACTOR'])
week_01= float(df.loc[df['WEEK_OF_THE_YEAR']==1,'WEEKLY_SEASONALITY_FACTOR'])
week_02= float(df.loc[df['WEEK_OF_THE_YEAR']==2,'WEEKLY_SEASONALITY_FACTOR'])
week_53 = (week_51 + week_52 + week_01 + week_02)/4
data={'MARKET_ID_BLINDED': market,'WEEK_OF_THE_YEAR': 53,'WEEKLY_SEASONALITY_FACTOR' : week_53}
df_53 = pd.DataFrame(data, index=[0])
df=pd.concat([df,df_53],axis=0)
df = df.reset_index(drop=True)
return df
The function below visualizes the seasonality plots for every Market. The functions creates plots that show four stages of data processing before we arrive at final seasonality numbers.
This function is called iteratively in loop [once for each market] and saves the plots on disk </font> for visual examination.
def func_plot_seasonality(df_yearly_weekly,df_weekly,market):
# set size of the figure
fig = plt.figure(figsize=(20,10))
# add title to the plot
plt.suptitle(market, fontsize = 22)
# clalculate y-max for plots using sales volume
y_max_1= df_yearly_weekly['STD_PHYSICAL_VOL'].max()*1.1
# clalculate y-max for plots using seasonality
y_max_2= df_yearly_weekly['YEARLY_WEEKLY_SEASONALITY_FACTOR'].max()*1.1
# plot 1: trend of volume as is
ax = fig.add_subplot(2, 2, 1)
sns.lineplot(x='WEEK_OF_THE_YEAR', y='STD_PHYSICAL_VOL', hue='YEAR_NUMBER',
data=df_yearly_weekly,palette=['blue','red'],linewidth=2,ax=ax)
ax.set_xlim(0,54)
ax.set_ylim(0,y_max_1)
ax.set_xlabel("Week of the year", fontsize = 14)
ax.set_ylabel("Sales Volume", fontsize = 14)
# plot 2: trend of smoothened volume and baseline
ax = fig.add_subplot(2, 2, 2)
sns.lineplot(x='WEEK_OF_THE_YEAR', y='FITTED_STD_PHYSICAL_VOL', hue='YEAR_NUMBER',
data=df_yearly_weekly,palette=['blue','red'],linewidth=2,ax=ax)
sns.lineplot(x='WEEK_OF_THE_YEAR', y='ANNUAL_BASELINE_FITTED_STD_PHYSICAL_VOL', hue='YEAR_NUMBER',
data=df_yearly_weekly,palette=['blue','red'],linewidth=1,linestyle="dashed",ax=ax)
ax.set_xlim(0,54)
ax.set_ylim(0,y_max_1)
ax.set_xlabel("Week of the year", fontsize = 14)
ax.set_ylabel("Smoothened Sales Volume", fontsize = 14)
# plot 3: trend of weekly seasonality for each year
ax = fig.add_subplot(2, 2, 3)
sns.lineplot(x='WEEK_OF_THE_YEAR', y='YEARLY_WEEKLY_SEASONALITY_FACTOR', hue='YEAR_NUMBER',
data=df_yearly_weekly,palette=['blue','red'],linewidth=2, ax=ax)
ax.set_xlim(0,54)
ax.set_ylim(0,y_max_2)
ax.set_xlabel("Week of the year", fontsize = 14)
ax.set_ylabel("Scaled Sales Volume wrt. Annual Baseline", fontsize = 14)
# plot 4: trend of average weekly seasonality
ax = fig.add_subplot(2, 2, 4)
sns.lineplot(x='WEEK_OF_THE_YEAR', y='WEEKLY_SEASONALITY_FACTOR',
data=df_weekly,color='black',linewidth=2, ax=ax)
ax.set_xlim(0,54)
ax.set_ylim(0,y_max_2)
ax.set_xlabel("Week of the year", fontsize = 14)
ax.set_ylabel("Weekly Seasonality", fontsize = 14)
# save the plot on the disk for visual examination
plt.savefig('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_' + market + '.jpg')
print('Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_' + market + '.jpg')
plt.close()
# return the plot
return fig
The function below visualizes the adjusted sales i.e. sales after seasonality has been removed from the actual sales.
the function generates the plots for every product and every market.
Since we have 5,000+ combinations of products and markets, I call this function on random 100 combinations.
The function provides visual validation of results.
def func_plot_seasonality_adjusted_sales(df,market,product):
# filter and sort data for one product and one market
df_this_product_this_market=df[(df['MARKET_ID_BLINDED']==market) & (df['PRODUCT_ID_BLINDED']==product)]
df_this_product_this_market=df_this_product_this_market.sort_values('WEEK_ENDING_FRIDAY')
df_this_product_this_market
# create x and y series
x=df_this_product_this_market['WEEK_NUMBER']
y=df_this_product_this_market['STD_PHYSICAL_VOL']
y_adj=df_this_product_this_market['ADJUSTED_STD_PHYSICAL_VOL']
# declare plot
fig = plt.figure(figsize=(20,10))
plt.suptitle('Market='+ market + ' || ' + 'Product='+ product, fontsize = 22)
# plot 1: only show actuals
ax = fig.add_subplot(2, 1, 1)
ax.scatter(x, y, marker='o', color='red',s=25)
ax.plot(x, y, linestyle='-', linewidth=2, color='red',label='Actual')
ax.set_xlabel("Week #", fontsize = 20)
ax.set_ylabel("Sales Volume", fontsize = 20)
ax.tick_params(axis='both', which='major', labelsize=16)
ax.legend(fontsize=20)
#================
# plot 2: show actuals and adjusted
ax = fig.add_subplot(2, 1, 2)
ax.scatter(x, y, marker='o', color='red',s=25, alpha=0.2)
ax.plot(x, y, linestyle='-', linewidth=4, color='red', alpha=0.2, label='Actual')
ax.scatter(x, y_adj, marker='o', color='red',s=25)
ax.plot(x, y_adj, linestyle='--', linewidth=2, color='red', label='Adjusted')
ax.set_xlabel("Week #", fontsize = 20)
ax.set_ylabel("Sales Volume", fontsize = 20)
ax.tick_params(axis='both', which='major', labelsize=16)
ax.legend(fontsize=20)
# save the plot on the disk for visual examination
plt.savefig('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/adjusted_sales_plot_' + market + product + '.jpg')
#print('Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/adjusted_sales_plot_' + market + product +'.jpg')
plt.close()
# return the plot
return fig
Standardizing independent variables can be a useful preprocessing step in modeling that can improve the numerical stability, interpretability, and performance of the model.
def standardize_a_columns(col):
X=np.array(col)
X_mean = np.mean(X, axis=0)
X_std = np.std(X, axis=0)
X_std=np.where(X_std == 0, 1, X_std)
return (X - X_mean) / X_std
This function does several tasks [listed below]:
1) Subsets the data for a given product and market
2) Standardize the independent variables [by calling the function declared above]
3) Fits linear model on training set
4) Validates the model on test set
5) Collects modeling stats [R-sq, coefficents, p-values, mean, standard deviation etc]
This function is called iteratively to fit one model per market and product combination.
def func_build_price_elasticity_lin_model(this_market,this_product):
# subset the training data for this product and this market
df_train_this_iteration= df_train.copy()[(df_train['MARKET_ID_BLINDED']==this_market) \
& (df_train['PRODUCT_ID_BLINDED']==this_product)]
# subset the test data for this product and this market
df_test_this_iteration= df_test.copy()[(df_test['MARKET_ID_BLINDED']==this_market) \
& (df_test['PRODUCT_ID_BLINDED']==this_product)]
#print(X_train.shape[0],'||',X_test.shape[0])
if df_train_this_iteration.shape[0]>=5 and df_test_this_iteration.shape[0]>=5:
## DATA PREPARATION
# standardize 'Unit Price' in training set
df_train_this_iteration.loc[:,'UNIT_PRICE_STANDARDIZED']= \
standardize_a_columns(df_train_this_iteration['UNIT_PRICE'])
# standardize 'Number of Stores Selling' in training set
df_train_this_iteration.loc[:,'NUMBER_OF_STORES_SELLING_STANDARDIZED']= \
standardize_a_columns(df_train_this_iteration['NUMBER_OF_STORES_SELLING'])
# standardize 'Unit Price' in test set
df_test_this_iteration.loc[:,'UNIT_PRICE_STANDARDIZED']= \
standardize_a_columns(df_test_this_iteration['UNIT_PRICE'])
# standardize 'Number of Stores Selling' in test set
df_test_this_iteration.loc[:,'NUMBER_OF_STORES_SELLING_STANDARDIZED']= \
standardize_a_columns(df_test_this_iteration['NUMBER_OF_STORES_SELLING'])
# Create X and Y matrix for training and test sets
X_train=np.array(df_train_this_iteration[['UNIT_PRICE_STANDARDIZED','NUMBER_OF_STORES_SELLING_STANDARDIZED']])
y_train=np.array(df_train_this_iteration[['ADJUSTED_STD_PHYSICAL_VOL']])
X_test=np.array(df_test_this_iteration[['UNIT_PRICE_STANDARDIZED','NUMBER_OF_STORES_SELLING_STANDARDIZED']])
y_test=np.array(df_test_this_iteration[['ADJUSTED_STD_PHYSICAL_VOL']])
## MODEL FIT ON TRAINING SET ##
# add a constant column to the input features
X_train = sm.add_constant(X_train)
# create and fit the linear regression model
model = sm.OLS(y_train, X_train)
results = model.fit()
## PREDICTION ON TEST SET ##
# Add a constant column to the test data
X_test = sm.add_constant(X_test)
# Predict on the test data
y_pred_test = results.predict(X_test)
## COLLECT STATS ##
# Coefficients
coeff_intercept_std= results.params[0]
coeff_unit_price_std= results.params[1]
coeff_number_of_stores_selling_std= results.params[2]
# P-values
p_value_intercept= results.pvalues[0]
p_value_unit_price_std= results.pvalues[1]
p_value_number_of_stores_selling_std= results.pvalues[2]
# R-sq on training set
r2_train= results.rsquared
# R-sq on test set
r2_test = r2_score(y_test, y_pred_test)
# Standard deviations for inverse transformation of coefficients
std_unit_price_original= np.std(df_train_this_iteration['UNIT_PRICE'])
std_number_of_stores_selling_original= np.std(df_train_this_iteration['NUMBER_OF_STORES_SELLING'])
std_adjusted_physical_vol=np.std(df_train_this_iteration['ADJUSTED_STD_PHYSICAL_VOL'])
# Mean for inverse transformation of intercept
mean_unit_price_original= np.mean(df_train_this_iteration['UNIT_PRICE'])
mean_number_of_stores_selling_original= np.mean(df_train_this_iteration['NUMBER_OF_STORES_SELLING'])
mean_adjusted_physical_vol=np.mean(df_train_this_iteration['ADJUSTED_STD_PHYSICAL_VOL'])
## COMBINE ALL STATS IN ONE ROW
stats_this_iteration={
'MARKET_ID_BLINDED' : this_market,
'PRODUCT_ID_BLINDED' : this_product,
'COEFF_INTERCEPT_STD' : coeff_intercept_std,
'COEFF_UNIT_PRICE_STD' : coeff_unit_price_std,
'COEFF_NUMBER_OF_STORES_SELLING_STD' : coeff_number_of_stores_selling_std,
'P_VALUE_INTERCEPT' : p_value_intercept,
'P_VALUE_UNIT_PRICE_STD' : p_value_unit_price_std,
'P_VALUE_NUMBER_OF_STORES_SELLING_STD' : p_value_number_of_stores_selling_std,
'R2_TRAIN' : r2_train,
'R2_TEST' : r2_test,
'STD_UNIT_PRICE_ORIGINAL' : std_unit_price_original,
'STD_NUMBER_OF_STORES_SELLING_ORIGINAL' : std_number_of_stores_selling_original,
'STD_ADJUSTED_PHYSICAL_VOL' : std_adjusted_physical_vol,
'MEAN_UNIT_PRICE_ORIGINAL' : mean_unit_price_original,
'MEAN_NUMBER_OF_STORES_SELLING_ORIGINAL' : mean_number_of_stores_selling_original,
'MEAN_ADJUSTED_PHYSICAL_VOL' : mean_adjusted_physical_vol
}
df_stats_this_iteration = pd.DataFrame(stats_this_iteration, index=[0])
else :
df_stats_this_iteration = pd.DataFrame()
raise ValueError()
return df_stats_this_iteration
This function does several tasks [listed below]:
1) Subsets the data for a given product and market
2) Standardize the independent variables [by calling the function declared above]
3) Fits random forest regression model on training set
4) Validates the model on test set
5) Collects modeling stats [R-sq]
This function is called iteratively to fit one model per market and product combination.
The functions saves random forest models [one for each market and product] on disk as pickle file.
def func_build_price_elasticity_rf_model(this_market,this_product):
# subset the training data for this product and this market
df_train_this_iteration= df_train.copy()[(df_train['MARKET_ID_BLINDED']==this_market) \
& (df_train['PRODUCT_ID_BLINDED']==this_product)]
# subset the test data for this product and this market
df_test_this_iteration= df_test.copy()[(df_test['MARKET_ID_BLINDED']==this_market) \
& (df_test['PRODUCT_ID_BLINDED']==this_product)]
#print(X_train.shape[0],'||',X_test.shape[0])
if df_train_this_iteration.shape[0]>=5 and df_test_this_iteration.shape[0]>=5:
## DATA PREPARATION
# standardize 'Unit Price' in training set
df_train_this_iteration.loc[:,'UNIT_PRICE_STANDARDIZED']= \
standardize_a_columns(df_train_this_iteration['UNIT_PRICE'])
# standardize 'Number of Stores Selling' in training set
df_train_this_iteration.loc[:,'NUMBER_OF_STORES_SELLING_STANDARDIZED']= \
standardize_a_columns(df_train_this_iteration['NUMBER_OF_STORES_SELLING'])
# standardize 'Unit Price' in test set
df_test_this_iteration.loc[:,'UNIT_PRICE_STANDARDIZED']= \
standardize_a_columns(df_test_this_iteration['UNIT_PRICE'])
# standardize 'Number of Stores Selling' in test set
df_test_this_iteration.loc[:,'NUMBER_OF_STORES_SELLING_STANDARDIZED']= \
standardize_a_columns(df_test_this_iteration['NUMBER_OF_STORES_SELLING'])
# Create X and Y matrix for training and test sets
X_train=np.array(df_train_this_iteration[['UNIT_PRICE_STANDARDIZED','NUMBER_OF_STORES_SELLING_STANDARDIZED']])
y_train=np.array(df_train_this_iteration[['ADJUSTED_STD_PHYSICAL_VOL']])
X_test=np.array(df_test_this_iteration[['UNIT_PRICE_STANDARDIZED','NUMBER_OF_STORES_SELLING_STANDARDIZED']])
y_test=np.array(df_test_this_iteration[['ADJUSTED_STD_PHYSICAL_VOL']])
## MODEL FIT ON TRAINING SET ##
# Train a random forest regression model
model = RandomForestRegressor()
model.fit(X_train, y_train)
## PREDICTION ##
# Predict on the training data
y_pred_train = model.predict(X_train)
# Predict on the test data
y_pred_test = model.predict(X_test)
# R-sq on training set
r2_train = r2_score(y_train, y_pred_train)
# R-sq on test set
r2_test = r2_score(y_test, y_pred_test)
## COMBINE ALL STATS IN ONE ROW
stats_this_iteration={
'MARKET_ID_BLINDED' : this_market,
'PRODUCT_ID_BLINDED' : this_product,
'R2_TRAIN' : r2_train,
'R2_TEST' : r2_test,
}
df_stats_this_iteration = pd.DataFrame(stats_this_iteration, index=[0])
# Save the trained model as a pickle file
with open(f'/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/rf_models/rf_model_{this_market}_{this_product}.pkl', 'wb') as rf:
pickle.dump(model, rf)
else :
df_stats_this_iteration = pd.DataFrame()
raise ValueError()
return df_stats_this_iteration
A partial dependence plot (PDP) is a graphical representation of the relationship between a target variable and one or more predictor variables in a machine learning model. It shows how the target variable changes as a function of the predictor variable(s) while holding all other variables constant.
This function also saves plots on the disk.
def func_plot_partial_dependence_for_unit_price(this_market,this_product):
# CREATE THE DATASET
this_data=df6.copy()[(df6['MARKET_ID_BLINDED']==this_market) \
& (df6['PRODUCT_ID_BLINDED']==this_product)]
# standardize 'Unit Price' in test set
this_data.loc[:,'UNIT_PRICE_STANDARDIZED']= \
standardize_a_columns(this_data['UNIT_PRICE'])
# standardize 'Number of Stores Selling' in test set
this_data.loc[:,'NUMBER_OF_STORES_SELLING_STANDARDIZED']= \
standardize_a_columns(this_data['NUMBER_OF_STORES_SELLING'])
# just keep the required columns
this_data=this_data.copy()[['ADJUSTED_STD_PHYSICAL_VOL','UNIT_PRICE_STANDARDIZED','NUMBER_OF_STORES_SELLING_STANDARDIZED']]
# LOAD THE MODEL
with open(f'/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/rf_models/rf_model_{this_market}_{this_product}.pkl', 'rb') as rf:
this_model=pickle.load(rf)
# GENERATE PARTIAL DEPENDENCE PLOT
fig, ax = plt.subplots(figsize=(10, 3))
fig=plot_partial_dependence(this_model, this_data.drop('ADJUSTED_STD_PHYSICAL_VOL', axis=1),
features, target='ADJUSTED_STD_PHYSICAL_VOL', ax=ax)
plt.suptitle('Partial Dependence Plot for Unit Price', fontsize = 12)
plt.title(f'Market ={this_market} || Product = {this_product}' , fontsize = 8)
plt.savefig(f'/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/PDP_Unit_Price_{this_market}_{this_product}.jpg')
Seasonality in modeling refers to the presence of regular, recurring patterns in a time series data set that repeat over a fixed time interval or season.
Here, we will determine seasonality on annual data using weekly grain.
For determining seasonality, we need to know the 'Week of the Year'.
Let's have a look at our data processed so far.
df3
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | PRICE | EQUIVALENT_VOLUME | UNITS | STD_PHYSICAL_VOL | NUMBER_OF_STORES | NUMBER_OF_STORES_SELLING | UNIT_PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | M00217 | P002496971635 | 2021-01-01 | 88.23 | 17.25 | 23.00 | 11.50 | 21.00 | 3.00 | 7.67 |
| 1 | M00513 | P002496971635 | 2021-01-01 | 24.28 | 4.50 | 6.00 | 3.00 | 52.00 | 3.00 | 8.09 |
| 8 | M01120 | P006198451610 | 2022-03-11 | 17,803.51 | 1,206.47 | 4,454.68 | 742.45 | 439.42 | 330.48 | 23.98 |
| 10 | M01519 | P002496971635 | 2021-01-01 | 12.97 | 2.25 | 3.00 | 1.50 | 23.00 | 2.00 | 8.65 |
| 12 | M01119 | P002496971635 | 2021-01-01 | 194.20 | 33.00 | 44.00 | 22.00 | 34.00 | 11.00 | 8.83 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 853308 | M00077 | P014842881625 | 2021-05-21 | 547.37 | 22.17 | 266.00 | 11.08 | 44.00 | 34.00 | 49.39 |
| 853309 | M00207 | P012879791534 | 2021-07-16 | 26,253.74 | 2,437.71 | 4,875.42 | 1,625.14 | 566.35 | 481.53 | 16.15 |
| 853312 | M00207 | P016629651440 | 2021-06-18 | 20,049.71 | 3,250.46 | 4,333.95 | 2,166.97 | 564.83 | 405.67 | 9.25 |
| 853318 | M00077 | P014842881625 | 2021-01-22 | 770.37 | 31.17 | 374.00 | 15.58 | 44.00 | 36.00 | 49.44 |
| 853320 | M00207 | P087414971407 | 2022-01-28 | 11,020.34 | 1,565.31 | 10,733.57 | 894.46 | 590.02 | 500.03 | 12.32 |
414819 rows × 10 columns
In the steps below, I am calculating the following 3 new factors:
1) WEEK_NUMBER - A serial number assigned to the week [This does not reset to 1 as the year changes].
2) WEEK_OF_THE_YEAR - A serial number assigned to the week [This resets to 1 as the year changes]. This is used to find corresponding weeks across various years.
3) YEAR_NUMBER - A serial number assigned to the year. [This increments by 1 as the year changes]. We are determining seasonality over an year and this helps to average out seasonality across multiple years.
# Create list of unique week dates
week_list=df3[['WEEK_ENDING_FRIDAY']].drop_duplicates().sort_values('WEEK_ENDING_FRIDAY')['WEEK_ENDING_FRIDAY'].to_list()
# Calculate week of the year
# Calculate Year Number
Week_of_the_Year=[]
Year_Num=[]
yr=0
for w in week_list:
wk=datetime.strptime(str(w)[0:10],'%Y-%m-%d').date().isocalendar()[1]
Week_of_the_Year.append(wk)
if wk ==1:
yr=yr+1
Year_Num.append(yr)
# combine the lists into pandas dataframe
df_week_mapping=pd.DataFrame({
'WEEK_ENDING_FRIDAY' : week_list,
'WEEK_OF_THE_YEAR' : Week_of_the_Year,
'YEAR_NUMBER' : Year_Num,
})
# also create serial numbers for week [to be used later in graphs]
df_week_mapping=df_week_mapping.reset_index()
df_week_mapping = df_week_mapping.rename(columns={'index': 'WEEK_NUMBER'})
# display the week mapping pandas dataframe
# set the display option to show all rows
pd.set_option('display.max_rows', 100)
df_week_mapping
| WEEK_NUMBER | WEEK_ENDING_FRIDAY | WEEK_OF_THE_YEAR | YEAR_NUMBER | |
|---|---|---|---|---|
| 0 | 0 | 2021-01-01 | 53 | 0 |
| 1 | 1 | 2021-01-08 | 1 | 1 |
| 2 | 2 | 2021-01-15 | 2 | 1 |
| 3 | 3 | 2021-01-22 | 3 | 1 |
| 4 | 4 | 2021-01-29 | 4 | 1 |
| 5 | 5 | 2021-02-05 | 5 | 1 |
| 6 | 6 | 2021-02-12 | 6 | 1 |
| 7 | 7 | 2021-02-19 | 7 | 1 |
| 8 | 8 | 2021-02-26 | 8 | 1 |
| 9 | 9 | 2021-03-05 | 9 | 1 |
| 10 | 10 | 2021-03-12 | 10 | 1 |
| 11 | 11 | 2021-03-19 | 11 | 1 |
| 12 | 12 | 2021-03-26 | 12 | 1 |
| 13 | 13 | 2021-04-02 | 13 | 1 |
| 14 | 14 | 2021-04-09 | 14 | 1 |
| 15 | 15 | 2021-04-16 | 15 | 1 |
| 16 | 16 | 2021-04-23 | 16 | 1 |
| 17 | 17 | 2021-04-30 | 17 | 1 |
| 18 | 18 | 2021-05-07 | 18 | 1 |
| 19 | 19 | 2021-05-14 | 19 | 1 |
| 20 | 20 | 2021-05-21 | 20 | 1 |
| 21 | 21 | 2021-05-28 | 21 | 1 |
| 22 | 22 | 2021-06-04 | 22 | 1 |
| 23 | 23 | 2021-06-11 | 23 | 1 |
| 24 | 24 | 2021-06-18 | 24 | 1 |
| 25 | 25 | 2021-06-25 | 25 | 1 |
| 26 | 26 | 2021-07-02 | 26 | 1 |
| 27 | 27 | 2021-07-09 | 27 | 1 |
| 28 | 28 | 2021-07-16 | 28 | 1 |
| 29 | 29 | 2021-07-23 | 29 | 1 |
| 30 | 30 | 2021-07-30 | 30 | 1 |
| 31 | 31 | 2021-08-06 | 31 | 1 |
| 32 | 32 | 2021-08-13 | 32 | 1 |
| 33 | 33 | 2021-08-20 | 33 | 1 |
| 34 | 34 | 2021-08-27 | 34 | 1 |
| 35 | 35 | 2021-09-03 | 35 | 1 |
| 36 | 36 | 2021-09-10 | 36 | 1 |
| 37 | 37 | 2021-09-17 | 37 | 1 |
| 38 | 38 | 2021-09-24 | 38 | 1 |
| 39 | 39 | 2021-10-01 | 39 | 1 |
| 40 | 40 | 2021-10-08 | 40 | 1 |
| 41 | 41 | 2021-10-15 | 41 | 1 |
| 42 | 42 | 2021-10-22 | 42 | 1 |
| 43 | 43 | 2021-10-29 | 43 | 1 |
| 44 | 44 | 2021-11-05 | 44 | 1 |
| 45 | 45 | 2021-11-12 | 45 | 1 |
| 46 | 46 | 2021-11-19 | 46 | 1 |
| 47 | 47 | 2021-11-26 | 47 | 1 |
| 48 | 48 | 2021-12-03 | 48 | 1 |
| 49 | 49 | 2021-12-10 | 49 | 1 |
| 50 | 50 | 2021-12-17 | 50 | 1 |
| 51 | 51 | 2021-12-24 | 51 | 1 |
| 52 | 52 | 2021-12-31 | 52 | 1 |
| 53 | 53 | 2022-01-07 | 1 | 2 |
| 54 | 54 | 2022-01-14 | 2 | 2 |
| 55 | 55 | 2022-01-21 | 3 | 2 |
| 56 | 56 | 2022-01-28 | 4 | 2 |
| 57 | 57 | 2022-02-04 | 5 | 2 |
| 58 | 58 | 2022-02-11 | 6 | 2 |
| 59 | 59 | 2022-02-18 | 7 | 2 |
| 60 | 60 | 2022-02-25 | 8 | 2 |
| 61 | 61 | 2022-03-04 | 9 | 2 |
| 62 | 62 | 2022-03-11 | 10 | 2 |
| 63 | 63 | 2022-03-18 | 11 | 2 |
| 64 | 64 | 2022-03-25 | 12 | 2 |
| 65 | 65 | 2022-04-01 | 13 | 2 |
| 66 | 66 | 2022-04-08 | 14 | 2 |
| 67 | 67 | 2022-04-15 | 15 | 2 |
| 68 | 68 | 2022-04-22 | 16 | 2 |
| 69 | 69 | 2022-04-29 | 17 | 2 |
| 70 | 70 | 2022-05-06 | 18 | 2 |
| 71 | 71 | 2022-05-13 | 19 | 2 |
| 72 | 72 | 2022-05-20 | 20 | 2 |
| 73 | 73 | 2022-05-27 | 21 | 2 |
| 74 | 74 | 2022-06-03 | 22 | 2 |
| 75 | 75 | 2022-06-10 | 23 | 2 |
| 76 | 76 | 2022-06-17 | 24 | 2 |
| 77 | 77 | 2022-06-24 | 25 | 2 |
| 78 | 78 | 2022-07-01 | 26 | 2 |
| 79 | 79 | 2022-07-08 | 27 | 2 |
| 80 | 80 | 2022-07-15 | 28 | 2 |
| 81 | 81 | 2022-07-22 | 29 | 2 |
| 82 | 82 | 2022-07-29 | 30 | 2 |
| 83 | 83 | 2022-08-05 | 31 | 2 |
| 84 | 84 | 2022-08-12 | 32 | 2 |
| 85 | 85 | 2022-08-19 | 33 | 2 |
| 86 | 86 | 2022-08-26 | 34 | 2 |
| 87 | 87 | 2022-09-02 | 35 | 2 |
| 88 | 88 | 2022-09-09 | 36 | 2 |
| 89 | 89 | 2022-09-16 | 37 | 2 |
| 90 | 90 | 2022-09-23 | 38 | 2 |
| 91 | 91 | 2022-09-30 | 39 | 2 |
| 92 | 92 | 2022-10-07 | 40 | 2 |
| 93 | 93 | 2022-10-14 | 41 | 2 |
| 94 | 94 | 2022-10-21 | 42 | 2 |
# combine week mapping with the base dataset
df4=pd.merge(left=df3,right=df_week_mapping, on ='WEEK_ENDING_FRIDAY', how='left')
df4
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | PRICE | EQUIVALENT_VOLUME | UNITS | STD_PHYSICAL_VOL | NUMBER_OF_STORES | NUMBER_OF_STORES_SELLING | UNIT_PRICE | WEEK_NUMBER | WEEK_OF_THE_YEAR | YEAR_NUMBER | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | M00217 | P002496971635 | 2021-01-01 | 88.23 | 17.25 | 23.00 | 11.50 | 21.00 | 3.00 | 7.67 | 0 | 53 | 0 |
| 1 | M00513 | P002496971635 | 2021-01-01 | 24.28 | 4.50 | 6.00 | 3.00 | 52.00 | 3.00 | 8.09 | 0 | 53 | 0 |
| 2 | M01120 | P006198451610 | 2022-03-11 | 17,803.51 | 1,206.47 | 4,454.68 | 742.45 | 439.42 | 330.48 | 23.98 | 62 | 10 | 2 |
| 3 | M01519 | P002496971635 | 2021-01-01 | 12.97 | 2.25 | 3.00 | 1.50 | 23.00 | 2.00 | 8.65 | 0 | 53 | 0 |
| 4 | M01119 | P002496971635 | 2021-01-01 | 194.20 | 33.00 | 44.00 | 22.00 | 34.00 | 11.00 | 8.83 | 0 | 53 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 414814 | M00077 | P014842881625 | 2021-05-21 | 547.37 | 22.17 | 266.00 | 11.08 | 44.00 | 34.00 | 49.39 | 20 | 20 | 1 |
| 414815 | M00207 | P012879791534 | 2021-07-16 | 26,253.74 | 2,437.71 | 4,875.42 | 1,625.14 | 566.35 | 481.53 | 16.15 | 28 | 28 | 1 |
| 414816 | M00207 | P016629651440 | 2021-06-18 | 20,049.71 | 3,250.46 | 4,333.95 | 2,166.97 | 564.83 | 405.67 | 9.25 | 24 | 24 | 1 |
| 414817 | M00077 | P014842881625 | 2021-01-22 | 770.37 | 31.17 | 374.00 | 15.58 | 44.00 | 36.00 | 49.44 | 3 | 3 | 1 |
| 414818 | M00207 | P087414971407 | 2022-01-28 | 11,020.34 | 1,565.31 | 10,733.57 | 894.46 | 590.02 | 500.03 | 12.32 | 56 | 4 | 2 |
414819 rows × 13 columns
Seasonality in modeling refers to the presence of regular, recurring patterns in a time series data set that repeat over a fixed time interval or season. Seasonality can be observed in a variety of data sets, such as daily, weekly, monthly, quarterly, or annual data.
Seasonality can be caused by a variety of factors, such as weather patterns, holidays, and cultural or societal trends.
Ignoring seasonality in modeling can lead to inaccurate predictions and biased estimates of model parameters. Therefore, it is important to identify and account for seasonality in time series modeling to improve the accuracy and usefulness of the model.
The general understanding is that seasonality varies by geographical location and is determined at some higher grain. The lower the grain , the more the noise and we do not want to capture the noise in seasonality.
Seasonality represents the regular and recurring patterns in a time series, and these patterns can be used to make predictions about future values. Noise, on the other hand, represents the random fluctuations or variability in the data that cannot be attributed to any known factors or patterns.
If the seasonality component of a model captures noise, it may result in overfitting of the data, meaning that the model fits too closely to the noise in the data and is not able to generalize well to new data. This can lead to poor predictive performance and make it difficult to interpret the model.
The code below checks if we can still use the lower grain of market and product to determine seasonality.
# Keep only the columns required to calculate seasonality
df_seasonality=df4[['MARKET_ID_BLINDED','PRODUCT_ID_BLINDED','WEEK_ENDING_FRIDAY','YEAR_NUMBER','WEEK_OF_THE_YEAR','STD_PHYSICAL_VOL']]
# sort the dataframe by market,product, year week
df_seasonality= df_seasonality.sort_values(by=['MARKET_ID_BLINDED','PRODUCT_ID_BLINDED','WEEK_ENDING_FRIDAY'])
# display sorted dataframe
df_seasonality
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | YEAR_NUMBER | WEEK_OF_THE_YEAR | STD_PHYSICAL_VOL | |
|---|---|---|---|---|---|---|
| 340698 | M00017 | P000008288918 | 2021-01-29 | 1 | 4 | 0.75 |
| 340702 | M00017 | P000008288918 | 2021-02-05 | 1 | 5 | 57.01 |
| 340705 | M00017 | P000008288918 | 2021-02-12 | 1 | 6 | 284.45 |
| 340708 | M00017 | P000008288918 | 2021-02-19 | 1 | 7 | 448.00 |
| 340711 | M00017 | P000008288918 | 2021-02-26 | 1 | 8 | 323.16 |
| ... | ... | ... | ... | ... | ... | ... |
| 204659 | M01519 | P181710721570 | 2022-09-23 | 2 | 38 | 5.62 |
| 204666 | M01519 | P181710721570 | 2022-09-30 | 2 | 39 | 3.17 |
| 204675 | M01519 | P181710721570 | 2022-10-07 | 2 | 40 | 5.08 |
| 204682 | M01519 | P181710721570 | 2022-10-14 | 2 | 41 | 4.42 |
| 204691 | M01519 | P181710721570 | 2022-10-21 | 2 | 42 | 5.25 |
414819 rows × 6 columns
# Get rid of odd year 0 and week 53
print("Rows Before",df_seasonality.shape[0])
df_seasonality=df_seasonality[(df_seasonality['YEAR_NUMBER']!=0) & (df_seasonality['WEEK_OF_THE_YEAR']!=53)]
print("Rows After",df_seasonality.shape[0])
Rows Before 414819 Rows After 410710
# check to see if all the products in all the markets are sold in all the weeks
df_seasonality.groupby(['MARKET_ID_BLINDED','PRODUCT_ID_BLINDED'],as_index=False) \
.agg(COUNT_WEEKS=pd.NamedAgg(column='WEEK_ENDING_FRIDAY',aggfunc=lambda x: len(x.unique())))
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | COUNT_WEEKS | |
|---|---|---|---|
| 0 | M00017 | P000008288918 | 91 |
| 1 | M00017 | P000013210025 | 47 |
| 2 | M00017 | P000023761440 | 94 |
| 3 | M00017 | P000049425151 | 94 |
| 4 | M00017 | P000058340661 | 94 |
| ... | ... | ... | ... |
| 5138 | M01519 | P174215091353 | 76 |
| 5139 | M01519 | P174813311645 | 94 |
| 5140 | M01519 | P175918231397 | 1 |
| 5141 | M01519 | P180916311819 | 10 |
| 5142 | M01519 | P181710721570 | 94 |
5143 rows × 3 columns
We can clearly see that all the products are not sold in all the markets every week.
For seasonality we need data for every week.
This means we will have to calculate seasonality at an higher level of aggregation.
I have decided to calculate seasonality at market level. This means that same seasonality factors will apply to all the products being sold within a market.
In the sections below, I am calculating seasonality at 'market' level..
# aggregate the data to market level
df_seasonality_market= df_seasonality \
.groupby(['MARKET_ID_BLINDED','WEEK_ENDING_FRIDAY','YEAR_NUMBER','WEEK_OF_THE_YEAR'],as_index=False) \
.agg(STD_PHYSICAL_VOL=pd.NamedAgg(column='STD_PHYSICAL_VOL',aggfunc=sum)) \
.sort_values(by=['MARKET_ID_BLINDED','WEEK_ENDING_FRIDAY'])
df_seasonality_market
| MARKET_ID_BLINDED | WEEK_ENDING_FRIDAY | YEAR_NUMBER | WEEK_OF_THE_YEAR | STD_PHYSICAL_VOL | |
|---|---|---|---|---|---|
| 0 | M00017 | 2021-01-08 | 1 | 1 | 163,410.66 |
| 1 | M00017 | 2021-01-15 | 1 | 2 | 167,876.83 |
| 2 | M00017 | 2021-01-22 | 1 | 3 | 159,240.89 |
| 3 | M00017 | 2021-01-29 | 1 | 4 | 148,041.28 |
| 4 | M00017 | 2021-02-05 | 1 | 5 | 167,977.64 |
| ... | ... | ... | ... | ... | ... |
| 1875 | M01519 | 2022-09-23 | 2 | 38 | 4,961.23 |
| 1876 | M01519 | 2022-09-30 | 2 | 39 | 4,354.47 |
| 1877 | M01519 | 2022-10-07 | 2 | 40 | 4,512.73 |
| 1878 | M01519 | 2022-10-14 | 2 | 41 | 4,213.91 |
| 1879 | M01519 | 2022-10-21 | 2 | 42 | 4,499.51 |
1880 rows × 5 columns
# check to see if we have data for all the weeks in all the markets
df_seasonality_market.groupby(['MARKET_ID_BLINDED'],as_index=False) \
.agg(COUNT_WEEKS=pd.NamedAgg(column='WEEK_ENDING_FRIDAY',aggfunc=lambda x: len(x.unique())))
| MARKET_ID_BLINDED | COUNT_WEEKS | |
|---|---|---|
| 0 | M00017 | 94 |
| 1 | M00045 | 94 |
| 2 | M00077 | 94 |
| 3 | M00085 | 94 |
| 4 | M00110 | 94 |
| 5 | M00182 | 94 |
| 6 | M00207 | 94 |
| 7 | M00208 | 94 |
| 8 | M00211 | 94 |
| 9 | M00217 | 94 |
| 10 | M00320 | 94 |
| 11 | M00418 | 94 |
| 12 | M00513 | 94 |
| 13 | M00517 | 94 |
| 14 | M00520 | 94 |
| 15 | M01015 | 94 |
| 16 | M01119 | 94 |
| 17 | M01120 | 94 |
| 18 | M01316 | 94 |
| 19 | M01519 | 94 |
All the markets have equal count of 94 weeks. This data is good for calculating seasonality at 'Market Level'.
Let's visually examine the trend of sales across each market. Each series in the graph below is a calendar year.
Note: We were given data for 1 year and 10 months. So, year 2 is not complete.
warnings.filterwarnings('ignore')
market_list = df_seasonality_market['MARKET_ID_BLINDED'].drop_duplicates().to_list()
y_max=df_seasonality_market['STD_PHYSICAL_VOL'].max()*1.1
fig, axes = plt.subplots(nrows = 10, ncols = 2) # axes is 2d array (3x3)
axes = axes.flatten() # Convert axes to 1d array of length 9
fig.set_size_inches(15, 50)
for ax, col in zip(axes, market_list):
df_seasonality_this_market=df_seasonality_market[df_seasonality_market['MARKET_ID_BLINDED']==col]
sns.lineplot(x=df_seasonality_this_market['WEEK_OF_THE_YEAR'],y=df_seasonality_this_market['STD_PHYSICAL_VOL']
,hue=df_seasonality_this_market['YEAR_NUMBER'],legend=True, linewidth=2,palette=['blue','red'], ax=ax)
#ax.set_ylim(0,y_max)
ax.set_title(col)
If you notice carefully, you will see the general trend of sales across the two years is in sync. This is true for all the 20 markets. .
Let's determine seasonality coefficient for each market..
The code below calculates weekly seasonality coefficients. It runs in loop and processes one market at a time.
market_list = df_seasonality_market['MARKET_ID_BLINDED'].drop_duplicates().to_list()
# create blank dataset that will store year and week level info [for use in plots]
df_seasonality_market2 = pd.DataFrame()
# create blank dataset that contains weekly seasonality factors for each market
# This is the actual output that should be used for any further modeling
df_weekly_seasonality = pd.DataFrame()
for market in market_list:
print("\n=========================")
print("Market:",market)
# filter the data for one market
df_seasonality_this_market=df_seasonality_market[df_seasonality_market['MARKET_ID_BLINDED']==market]
# get fitted [smoothened] physical volume
df_seasonality_this_market['FITTED_STD_PHYSICAL_VOL']=model_fit(df=df_seasonality_this_market,
num_of_years=2,
col_year='YEAR_NUMBER',
col_week='WEEK_OF_THE_YEAR',
col_vol='STD_PHYSICAL_VOL')
# Find average sales [baseline sales] for each year
df_mean_baseline=pd.DataFrame(df_seasonality_this_market.groupby(['YEAR_NUMBER'],as_index=False) \
.agg(ANNUAL_BASELINE_FITTED_STD_PHYSICAL_VOL=pd.NamedAgg(column='FITTED_STD_PHYSICAL_VOL',aggfunc=lambda x: x.mean())))
df_seasonality_this_market=pd.merge(df_seasonality_this_market,df_mean_baseline,how="inner",on="YEAR_NUMBER")
# Calculate Seasonality Ratio over the average
df_seasonality_this_market['YEARLY_WEEKLY_SEASONALITY_FACTOR']=\
df_seasonality_this_market['FITTED_STD_PHYSICAL_VOL']/df_seasonality_this_market['ANNUAL_BASELINE_FITTED_STD_PHYSICAL_VOL']
# incerementally append to the overall dataset
df_seasonality_market2 = df_seasonality_market2.append(df_seasonality_this_market)
# calculate weekly seasonality for each market
df_weekly_seasonality_this_market= df_seasonality_this_market \
.groupby(['MARKET_ID_BLINDED','WEEK_OF_THE_YEAR'],as_index=False) \
.agg(WEEKLY_SEASONALITY_FACTOR=pd.NamedAgg(column='YEARLY_WEEKLY_SEASONALITY_FACTOR',aggfunc=lambda x: x.mean()))
# add week 53 seasonailty
df_weekly_seasonality_this_market=func_add_week_53_seasonality(df_weekly_seasonality_this_market)
# incerementally append to the overall dataset
df_weekly_seasonality = df_weekly_seasonality.append(df_weekly_seasonality_this_market)
========================= Market: M00017 Year : 1 Best Degree : 5 Year : 2 Best Degree : 5 ========================= Market: M00045 Year : 1 Best Degree : 4 Year : 2 Best Degree : 4 ========================= Market: M00077 Year : 1 Best Degree : 6 Year : 2 Best Degree : 6 ========================= Market: M00085 Year : 1 Best Degree : 8 Year : 2 Best Degree : 8 ========================= Market: M00110 Year : 1 Best Degree : 4 Year : 2 Best Degree : 4 ========================= Market: M00182 Year : 1 Best Degree : 6 Year : 2 Best Degree : 6 ========================= Market: M00207 Year : 1 Best Degree : 7 Year : 2 Best Degree : 7 ========================= Market: M00208 Year : 1 Best Degree : 4 Year : 2 Best Degree : 4 ========================= Market: M00211 Year : 1 Best Degree : 6 Year : 2 Best Degree : 6 ========================= Market: M00217 Year : 1 Best Degree : 9 Year : 2 Best Degree : 9 ========================= Market: M00320 Year : 1 Best Degree : 7 Year : 2 Best Degree : 7 ========================= Market: M00418 Year : 1 Best Degree : 6 Year : 2 Best Degree : 6 ========================= Market: M00513 Year : 1 Best Degree : 9 Year : 2 Best Degree : 9 ========================= Market: M00517 Year : 1 Best Degree : 4 Year : 2 Best Degree : 4 ========================= Market: M00520 Year : 1 Best Degree : 10 Year : 2 Best Degree : 10 ========================= Market: M01015 Year : 1 Best Degree : 10 Year : 2 Best Degree : 10 ========================= Market: M01119 Year : 1 Best Degree : 10 Year : 2 Best Degree : 10 ========================= Market: M01120 Year : 1 Best Degree : 6 Year : 2 Best Degree : 6 ========================= Market: M01316 Year : 1 Best Degree : 4 Year : 2 Best Degree : 4 ========================= Market: M01519 Year : 1 Best Degree : 8 Year : 2 Best Degree : 8
The data below shows seasonality coefficients broken down by year. This is not out final data but we had to generate it for visualization purpose [see graph below in section 6.3.2.4]
df_seasonality_market2
| MARKET_ID_BLINDED | WEEK_ENDING_FRIDAY | YEAR_NUMBER | WEEK_OF_THE_YEAR | STD_PHYSICAL_VOL | FITTED_STD_PHYSICAL_VOL | ANNUAL_BASELINE_FITTED_STD_PHYSICAL_VOL | YEARLY_WEEKLY_SEASONALITY_FACTOR | |
|---|---|---|---|---|---|---|---|---|
| 0 | M00017 | 2021-01-08 | 1 | 1 | 163,410.66 | 162,315.13 | 189,936.17 | 0.85 |
| 1 | M00017 | 2021-01-15 | 1 | 2 | 167,876.83 | 160,185.16 | 189,936.17 | 0.84 |
| 2 | M00017 | 2021-01-22 | 1 | 3 | 159,240.89 | 159,639.96 | 189,936.17 | 0.84 |
| 3 | M00017 | 2021-01-29 | 1 | 4 | 148,041.28 | 160,411.64 | 189,936.17 | 0.84 |
| 4 | M00017 | 2021-02-05 | 1 | 5 | 167,977.64 | 162,255.73 | 189,936.17 | 0.85 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 89 | M01519 | 2022-09-23 | 2 | 38 | 4,961.23 | 4,643.43 | 5,084.18 | 0.91 |
| 90 | M01519 | 2022-09-30 | 2 | 39 | 4,354.47 | 4,534.05 | 5,084.18 | 0.89 |
| 91 | M01519 | 2022-10-07 | 2 | 40 | 4,512.73 | 4,445.26 | 5,084.18 | 0.87 |
| 92 | M01519 | 2022-10-14 | 2 | 41 | 4,213.91 | 4,393.43 | 5,084.18 | 0.86 |
| 93 | M01519 | 2022-10-21 | 2 | 42 | 4,499.51 | 4,404.48 | 5,084.18 | 0.87 |
1880 rows × 8 columns
Please see below dataset. This is what final seasonality coefficient dataset looks like.
Here for every market, we have seasonality factors [coefficients] for every week of the year. .
df_weekly_seasonality
| MARKET_ID_BLINDED | WEEK_OF_THE_YEAR | WEEKLY_SEASONALITY_FACTOR | |
|---|---|---|---|
| 0 | M00017 | 1 | 0.84 |
| 1 | M00017 | 2 | 0.84 |
| 2 | M00017 | 3 | 0.84 |
| 3 | M00017 | 4 | 0.85 |
| 4 | M00017 | 5 | 0.86 |
| ... | ... | ... | ... |
| 48 | M01519 | 49 | 0.99 |
| 49 | M01519 | 50 | 1.02 |
| 50 | M01519 | 51 | 1.05 |
| 51 | M01519 | 52 | 1.11 |
| 52 | M01519 | 53 | 1.02 |
1060 rows × 3 columns
Let's iteratively generate step-wise [one step per phase of processing] seasonality plots for visual examination .
These plots are saved on the disk, but I have displayed one plot here in this notebook for explanation.
market_list = df_seasonality_market2['MARKET_ID_BLINDED'].drop_duplicates().to_list()
for market in market_list:
print("\n=========================")
print("Market:",market)
df_seasonality_this_market2=df_seasonality_market2[df_seasonality_market2['MARKET_ID_BLINDED']==market]
df_weekly_seasonality_this_market=df_weekly_seasonality[df_weekly_seasonality['MARKET_ID_BLINDED']==market]
seasonality_plot=func_plot_seasonality(df_yearly_weekly=df_seasonality_this_market2,
df_weekly=df_weekly_seasonality_this_market,
market=market)
========================= Market: M00017 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00017.jpg ========================= Market: M00045 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00045.jpg ========================= Market: M00077 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00077.jpg ========================= Market: M00085 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00085.jpg ========================= Market: M00110 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00110.jpg ========================= Market: M00182 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00182.jpg ========================= Market: M00207 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00207.jpg ========================= Market: M00208 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00208.jpg ========================= Market: M00211 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00211.jpg ========================= Market: M00217 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00217.jpg ========================= Market: M00320 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00320.jpg ========================= Market: M00418 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00418.jpg ========================= Market: M00513 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00513.jpg ========================= Market: M00517 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00517.jpg ========================= Market: M00520 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M00520.jpg ========================= Market: M01015 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M01015.jpg ========================= Market: M01119 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M01119.jpg ========================= Market: M01120 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M01120.jpg ========================= Market: M01316 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M01316.jpg ========================= Market: M01519 Saved the plot: /Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Plots/seasonality_plot_M01519.jpg
# just display the plot for last market here in the notebook
seasonality_plot
The above plot shows four phases of processing [go left-to-right in each row]:
1) Original/raw sales volume [divided by calendar year]
2) Smoothened sales volume [divided by calendar year]. The plot also shows annual average in dotted lines.
3) Scaled sales volume [scaled by annual average to bring both the years on the same level].
4) Average of two lines in plot 3. This is the final seasonality trend.
Here, i am saving the data processed so far on the disk so that I don't have to generate everything from beginning if the python notebook session ends.
Commented for now. i only uncomment them when I resume my work on the notebook.
#df_weekly_seasonality.to_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df_weekly_seasonality.pkl')
#df_seasonality_market2.to_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df_seasonality_market2.pkl')
#df3.to_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df3.pkl')
#df4.to_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df4.pkl')
#df_weekly_seasonality=pd.read_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df_weekly_seasonality.pkl')
#df_seasonality_market2=pd.read_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df_seasonality_market2.pkl')
#df3= pd.read_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df3.pkl')
#df4= pd.read_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df4.pkl')
We should remove seasonality from sales data to determine the true impact of other factors for several reasons:
Eliminating confounding factors: Seasonality can be a confounding factor that can mask or exaggerate the effect of other factors on sales. For example, the sales of ice cream may be higher in the summer due to seasonal effects, and this higher sales volume may be mistakenly attributed to a promotional campaign for ice cream. By removing the seasonal effects, we can isolate the impact of the promotional campaign on sales and avoid making incorrect conclusions about its effectiveness.
Improve forecasting accuracy: Seasonality can make it difficult to accurately forecast future sales, as it can cause patterns that are not representative of the underlying trends to emerge. By removing the seasonal effects, we can improve the accuracy of our forecasting models and make better-informed decisions about inventory management, staffing, and other business operations.
Better understanding of long-term trends: Removing seasonality from sales data can help reveal long-term trends that may be obscured by the regular seasonal fluctuations. This can provide a more accurate picture of the business's performance over time and help identify areas where improvements can be made.
In the steps below, we are calculating adjusted sales by removing seasonality from actual sales..
# join base data with seasonality data using market and week as keys
df5= pd.merge(left=df4,right=df_weekly_seasonality,on=['MARKET_ID_BLINDED','WEEK_OF_THE_YEAR'],how='left')
df5
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | PRICE | EQUIVALENT_VOLUME | UNITS | STD_PHYSICAL_VOL | NUMBER_OF_STORES | NUMBER_OF_STORES_SELLING | UNIT_PRICE | WEEK_NUMBER | WEEK_OF_THE_YEAR | YEAR_NUMBER | WEEKLY_SEASONALITY_FACTOR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | M00217 | P002496971635 | 2021-01-01 | 88.230 | 17.250 | 23.000 | 11.500 | 21.000 | 3.000 | 7.672174 | 0 | 53 | 0 | 1.006741 |
| 1 | M00513 | P002496971635 | 2021-01-01 | 24.280 | 4.500 | 6.000 | 3.000 | 52.000 | 3.000 | 8.093333 | 0 | 53 | 0 | 1.043116 |
| 2 | M01120 | P006198451610 | 2022-03-11 | 17803.505 | 1206.475 | 4454.678 | 742.446 | 439.420 | 330.484 | 23.979528 | 62 | 10 | 2 | 0.903644 |
| 3 | M01519 | P002496971635 | 2021-01-01 | 12.970 | 2.250 | 3.000 | 1.500 | 23.000 | 2.000 | 8.646667 | 0 | 53 | 0 | 1.018548 |
| 4 | M01119 | P002496971635 | 2021-01-01 | 194.200 | 33.000 | 44.000 | 22.000 | 34.000 | 11.000 | 8.827273 | 0 | 53 | 0 | 1.007311 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 414814 | M00077 | P014842881625 | 2021-05-21 | 547.370 | 22.167 | 266.000 | 11.083 | 44.000 | 34.000 | 49.388252 | 20 | 20 | 1 | 0.993541 |
| 414815 | M00207 | P012879791534 | 2021-07-16 | 26253.737 | 2437.712 | 4875.423 | 1625.141 | 566.353 | 481.526 | 16.154744 | 28 | 28 | 1 | 1.047571 |
| 414816 | M00207 | P016629651440 | 2021-06-18 | 20049.709 | 3250.460 | 4333.946 | 2166.973 | 564.827 | 405.670 | 9.252404 | 24 | 24 | 1 | 1.100148 |
| 414817 | M00077 | P014842881625 | 2021-01-22 | 770.370 | 31.167 | 374.000 | 15.583 | 44.000 | 36.000 | 49.436565 | 3 | 3 | 1 | 1.021421 |
| 414818 | M00207 | P087414971407 | 2022-01-28 | 11020.342 | 1565.313 | 10733.572 | 894.464 | 590.018 | 500.030 | 12.320610 | 56 | 4 | 2 | 0.951865 |
414819 rows × 14 columns
# calculate adjusted volume
df5['ADJUSTED_STD_PHYSICAL_VOL'] = df5['STD_PHYSICAL_VOL']/df5['WEEKLY_SEASONALITY_FACTOR']
# keep only necessary columns
df5=df5[['MARKET_ID_BLINDED','PRODUCT_ID_BLINDED','WEEK_ENDING_FRIDAY','WEEK_NUMBER','YEAR_NUMBER','WEEK_OF_THE_YEAR','STD_PHYSICAL_VOL','ADJUSTED_STD_PHYSICAL_VOL','UNIT_PRICE','NUMBER_OF_STORES','NUMBER_OF_STORES_SELLING']]
df5
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | WEEK_NUMBER | YEAR_NUMBER | WEEK_OF_THE_YEAR | STD_PHYSICAL_VOL | ADJUSTED_STD_PHYSICAL_VOL | UNIT_PRICE | NUMBER_OF_STORES | NUMBER_OF_STORES_SELLING | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | M00217 | P002496971635 | 2021-01-01 | 0 | 0 | 53 | 11.500 | 11.422999 | 7.672174 | 21.000 | 3.000 |
| 1 | M00513 | P002496971635 | 2021-01-01 | 0 | 0 | 53 | 3.000 | 2.875999 | 8.093333 | 52.000 | 3.000 |
| 2 | M01120 | P006198451610 | 2022-03-11 | 62 | 2 | 10 | 742.446 | 821.613609 | 23.979528 | 439.420 | 330.484 |
| 3 | M01519 | P002496971635 | 2021-01-01 | 0 | 0 | 53 | 1.500 | 1.472685 | 8.646667 | 23.000 | 2.000 |
| 4 | M01119 | P002496971635 | 2021-01-01 | 0 | 0 | 53 | 22.000 | 21.840335 | 8.827273 | 34.000 | 11.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 414814 | M00077 | P014842881625 | 2021-05-21 | 20 | 1 | 20 | 11.083 | 11.155050 | 49.388252 | 44.000 | 34.000 |
| 414815 | M00207 | P012879791534 | 2021-07-16 | 28 | 1 | 28 | 1625.141 | 1551.342814 | 16.154744 | 566.353 | 481.526 |
| 414816 | M00207 | P016629651440 | 2021-06-18 | 24 | 1 | 24 | 2166.973 | 1969.710734 | 9.252404 | 564.827 | 405.670 |
| 414817 | M00077 | P014842881625 | 2021-01-22 | 3 | 1 | 3 | 15.583 | 15.256203 | 49.436565 | 44.000 | 36.000 |
| 414818 | M00207 | P087414971407 | 2022-01-28 | 56 | 2 | 4 | 894.464 | 939.696499 | 12.320610 | 590.018 | 500.030 |
414819 rows × 11 columns
The two example plots below show how the 'Adjusted Sales' look like after seasonality is removed from the 'Actual Sales'.
func_plot_seasonality_adjusted_sales(df=df5, market='M01519', product='P002311681710')
func_plot_seasonality_adjusted_sales(df=df5, market='M00217', product='P007161407533')
The code below generates 100 random plots [similar to the two examples above] and saves them on disk. One plot corresponds to one market and one product. We have 5,000+ combination of markets and products. The plots help in quick visual examination of the results.
# create dataset with list of unique products and markets
df_unique_market_product=df5[['MARKET_ID_BLINDED','PRODUCT_ID_BLINDED']].drop_duplicates()
df_unique_market_product=df_unique_market_product.reset_index(drop=True)
for i in [random.randint(0,df_unique_market_product.shape[0]) for _ in range(100)]:
market=df_unique_market_product.loc[i,'MARKET_ID_BLINDED']
product=df_unique_market_product.loc[i,'PRODUCT_ID_BLINDED']
try:
adjusted_sales_plot=func_plot_seasonality_adjusted_sales(df=df5, market=market, product=product)
except:
print("Plot generation failed for Market =", market,"and Product =",product)
#df5.to_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df5.pkl')
df5= pd.read_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df5.pkl')
Now that we have removed seasonality from the sales, we can proceed to determine the 'True' price elasticity..
df6=df5.copy()[['MARKET_ID_BLINDED', 'PRODUCT_ID_BLINDED', 'WEEK_ENDING_FRIDAY','ADJUSTED_STD_PHYSICAL_VOL', 'UNIT_PRICE',
'NUMBER_OF_STORES_SELLING']]
df6
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | ADJUSTED_STD_PHYSICAL_VOL | UNIT_PRICE | NUMBER_OF_STORES_SELLING | |
|---|---|---|---|---|---|---|
| 0 | M00217 | P002496971635 | 2021-01-01 | 11.422999 | 7.672174 | 3.000 |
| 1 | M00513 | P002496971635 | 2021-01-01 | 2.875999 | 8.093333 | 3.000 |
| 2 | M01120 | P006198451610 | 2022-03-11 | 821.613609 | 23.979528 | 330.484 |
| 3 | M01519 | P002496971635 | 2021-01-01 | 1.472685 | 8.646667 | 2.000 |
| 4 | M01119 | P002496971635 | 2021-01-01 | 21.840335 | 8.827273 | 11.000 |
| ... | ... | ... | ... | ... | ... | ... |
| 414814 | M00077 | P014842881625 | 2021-05-21 | 11.155050 | 49.388252 | 34.000 |
| 414815 | M00207 | P012879791534 | 2021-07-16 | 1551.342814 | 16.154744 | 481.526 |
| 414816 | M00207 | P016629651440 | 2021-06-18 | 1969.710734 | 9.252404 | 405.670 |
| 414817 | M00077 | P014842881625 | 2021-01-22 | 15.256203 | 49.436565 | 36.000 |
| 414818 | M00207 | P087414971407 | 2022-01-28 | 939.696499 | 12.320610 | 500.030 |
414819 rows × 6 columns
To proceed with building the model for price elasticity, we will create training and test sets in 7:3 ratio.
To ensure that training and test sets proportionally represent our analysis universe, we will do stratified random sampling based on the following features: 1) Market - So that we have proportional representation of all the markets in both the sets.
2) Product - So that we have proportional representation of all the products in both the sets.
3) Month - So that we have seasonality proportionally represented in both the sets.
Month is currently not available in our dataset, we will have to create it.
# Create Month
df6.loc[:,'MONTH']= df6['WEEK_ENDING_FRIDAY'].dt.month
df6
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | ADJUSTED_STD_PHYSICAL_VOL | UNIT_PRICE | NUMBER_OF_STORES_SELLING | MONTH | |
|---|---|---|---|---|---|---|---|
| 0 | M00217 | P002496971635 | 2021-01-01 | 11.422999 | 7.672174 | 3.000 | 1 |
| 1 | M00513 | P002496971635 | 2021-01-01 | 2.875999 | 8.093333 | 3.000 | 1 |
| 2 | M01120 | P006198451610 | 2022-03-11 | 821.613609 | 23.979528 | 330.484 | 3 |
| 3 | M01519 | P002496971635 | 2021-01-01 | 1.472685 | 8.646667 | 2.000 | 1 |
| 4 | M01119 | P002496971635 | 2021-01-01 | 21.840335 | 8.827273 | 11.000 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 414814 | M00077 | P014842881625 | 2021-05-21 | 11.155050 | 49.388252 | 34.000 | 5 |
| 414815 | M00207 | P012879791534 | 2021-07-16 | 1551.342814 | 16.154744 | 481.526 | 7 |
| 414816 | M00207 | P016629651440 | 2021-06-18 | 1969.710734 | 9.252404 | 405.670 | 6 |
| 414817 | M00077 | P014842881625 | 2021-01-22 | 15.256203 | 49.436565 | 36.000 | 1 |
| 414818 | M00207 | P087414971407 | 2022-01-28 | 939.696499 | 12.320610 | 500.030 | 1 |
414819 rows × 7 columns
Stratified random sampling is important for ensuring that the sample used to train or evaluate a model is representative of the population being modeled, and it can help improve the accuracy, efficiency, and fairness of the model.
It involves dividing the population into subgroups or strata based on one or more important variables, and then selecting a random sample from each subgroup proportional to its size in the population.
Now, we can create training and test set via stratified random sampling using the stratas described in section 6.5.2 above.
Before we do that, let's check how these stratas are distributed in the original data
# display distribution of the training data
pd.options.display.float_format = '{:,.2f}'.format
print('Number of Rows:',df6.shape[0])
print((df6['MARKET_ID_BLINDED'].value_counts()) / len(df6) * 100)
print((df6['PRODUCT_ID_BLINDED'].value_counts()) / len(df6) * 100)
print((df6['MONTH'].value_counts()) / len(df6) * 100)
Number of Rows: 414819
M00211 7.15
M00207 7.04
M00418 6.58
M01120 6.56
M00182 6.55
M00085 4.88
M00513 4.80
M01316 4.79
M00017 4.75
M00217 4.61
M00208 4.52
M00077 4.41
M01015 4.34
M00110 4.31
M01119 4.28
M00320 4.25
M00517 4.25
M00045 4.03
M01519 4.02
M00520 3.90
Name: MARKET_ID_BLINDED, dtype: float64
P001358992242 0.46
P000991711435 0.46
P000155337256 0.46
P014842881625 0.46
P000825392813 0.46
...
P017159931175 0.09
P049512231223 0.08
P001645125473 0.08
P011001708253 0.07
P039015901366 0.02
Name: PRODUCT_ID_BLINDED, Length: 321, dtype: float64
4 10.61
7 10.58
9 9.50
1 9.20
5 8.49
6 8.48
8 8.47
10 8.45
3 8.41
2 8.35
12 5.25
11 4.21
Name: MONTH, dtype: float64
Let's create the training set as 70% of the base data
df_train=df6 \
.groupby(['MARKET_ID_BLINDED','PRODUCT_ID_BLINDED','MONTH'],group_keys=False) \
.apply(lambda x: x.sample(frac=0.7))
df_train
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | ADJUSTED_STD_PHYSICAL_VOL | UNIT_PRICE | NUMBER_OF_STORES_SELLING | MONTH | |
|---|---|---|---|---|---|---|---|
| 340851 | M00017 | P000008288918 | 2022-01-14 | 495.96 | 57.08 | 1,560.41 | 1 |
| 340848 | M00017 | P000008288918 | 2022-01-07 | 439.44 | 59.58 | 1,504.30 | 1 |
| 340855 | M00017 | P000008288918 | 2022-01-21 | 551.53 | 58.43 | 1,646.25 | 1 |
| 340698 | M00017 | P000008288918 | 2021-01-29 | 0.89 | 61.60 | 7.00 | 1 |
| 340705 | M00017 | P000008288918 | 2021-02-12 | 326.21 | 61.30 | 992.04 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 204241 | M01519 | P181710721570 | 2021-11-26 | 7.20 | 50.46 | 23.00 | 11 |
| 204272 | M01519 | P181710721570 | 2021-12-17 | 8.83 | 52.59 | 24.00 | 12 |
| 204251 | M01519 | P181710721570 | 2021-12-03 | 4.90 | 49.77 | 22.00 | 12 |
| 204292 | M01519 | P181710721570 | 2021-12-31 | 4.78 | 52.34 | 22.00 | 12 |
| 204282 | M01519 | P181710721570 | 2021-12-24 | 5.59 | 52.38 | 21.00 | 12 |
301672 rows × 7 columns
Let's look at the distribution of stratas in the training set
# display distribution of the training data
pd.options.display.float_format = '{:,.2f}'.format
print('Number of Rows:',df_train.shape[0])
print((df_train['MARKET_ID_BLINDED'].value_counts()) / len(df_train) * 100)
print((df_train['PRODUCT_ID_BLINDED'].value_counts()) / len(df_train) * 100)
print((df_train['MONTH'].value_counts()) / len(df_train) * 100)
Number of Rows: 301672
M00211 7.15
M00207 7.04
M00418 6.57
M01120 6.56
M00182 6.55
M00085 4.88
M00513 4.80
M01316 4.79
M00017 4.75
M00217 4.62
M00208 4.51
M00077 4.41
M01015 4.34
M00110 4.31
M01119 4.29
M00320 4.25
M00517 4.24
M01519 4.03
M00045 4.03
M00520 3.89
Name: MARKET_ID_BLINDED, dtype: float64
P011641151483 0.46
P003071623916 0.46
P012589031327 0.46
P012511500276 0.46
P000523252289 0.46
...
P004361528419 0.09
P049512231223 0.08
P001645125473 0.08
P011001708253 0.07
P039015901366 0.02
Name: PRODUCT_ID_BLINDED, Length: 321, dtype: float64
4 10.25
7 10.23
9 8.79
5 8.73
6 8.72
8 8.71
10 8.69
3 8.65
2 8.59
1 8.55
12 5.76
11 4.33
Name: MONTH, dtype: float64
Let's create the test set as 30% of the base data..
training_set_index=df_train.index.values.tolist()
df_test = df6[df6.index.isin(training_set_index)==False]
df_test
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | WEEK_ENDING_FRIDAY | ADJUSTED_STD_PHYSICAL_VOL | UNIT_PRICE | NUMBER_OF_STORES_SELLING | MONTH | |
|---|---|---|---|---|---|---|---|
| 6 | M00513 | P002496971635 | 2021-01-08 | 1.544655 | 9.980000 | 1.000 | 1 |
| 13 | M00513 | P002496971635 | 2021-01-15 | 0.528029 | 9.480000 | 1.000 | 1 |
| 14 | M00207 | P005701001406 | 2022-04-29 | 189.160111 | 21.555702 | 190.076 | 4 |
| 21 | M00517 | P002496971635 | 2021-06-18 | 0.436982 | 14.780000 | 1.000 | 6 |
| 22 | M01120 | P012589031327 | 2021-09-24 | 191.162480 | 52.670535 | 355.566 | 9 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 414789 | M00207 | P008541354896 | 2022-10-21 | 709.208178 | 11.323532 | 395.913 | 10 |
| 414791 | M00077 | P003281093665 | 2022-01-14 | 390.814985 | 11.597240 | 47.000 | 1 |
| 414795 | M00077 | P000497531023 | 2022-06-24 | 0.310983 | 35.555556 | 1.000 | 6 |
| 414809 | M00207 | P175918231397 | 2021-05-07 | 112.728088 | 28.943466 | 212.203 | 5 |
| 414811 | M00077 | P016001323615 | 2021-10-29 | 1949.895262 | 4.932344 | 45.000 | 10 |
113147 rows × 7 columns
Let's look at the distribution of stratas in the test set
# display distribution of the test data
pd.options.display.float_format = '{:,.2f}'.format
print('Number of Rows:',df_test.shape[0])
print((df_test['MARKET_ID_BLINDED'].value_counts()) / len(df_test) * 100)
print((df_test['PRODUCT_ID_BLINDED'].value_counts()) / len(df_test) * 100)
print((df_test['MONTH'].value_counts()) / len(df_test) * 100)
Number of Rows: 113147
M00211 7.15
M00207 7.04
M00418 6.58
M01120 6.57
M00182 6.55
M00085 4.88
M01316 4.81
M00513 4.79
M00017 4.75
M00217 4.57
M00208 4.53
M00077 4.39
M01015 4.36
M00110 4.31
M00517 4.26
M01119 4.26
M00320 4.25
M00045 4.03
M01519 3.99
M00520 3.91
Name: MARKET_ID_BLINDED, dtype: float64
P000523252289 0.46
P013661748665 0.46
P003213091538 0.46
P015241182282 0.46
P003071623916 0.46
...
P017159931175 0.09
P001645125473 0.08
P049512231223 0.08
P011001708253 0.07
P039015901366 0.02
Name: PRODUCT_ID_BLINDED, Length: 321, dtype: float64
4 11.55
7 11.52
9 11.39
1 10.95
5 7.84
6 7.83
8 7.83
10 7.80
3 7.77
2 7.71
12 3.91
11 3.90
Name: MONTH, dtype: float64
The distribution of categorical variables appears to be proportionate across training and test sets.
In the step below, we are creating linear regression models. One model is for one combination of markets and products. We intend to have 5,146 models to be exact as we have 5,146 combinations of markets and products.
This model creation function is called in loop and performs the following steps:
1) Subsets the data for a given product and market
2) Standardize the independent variables [by calling the function declared above]
3) Fits linear model on training set
4) Validates the model on test set
5) Collects modeling stats [R-sq, coefficents, p-values, mean, standard deviation etc]
This function is called iteratively to fit one model per market and product combination.
# suppress FutureWarning
warnings.filterwarnings('ignore', category=FutureWarning)
# create dataset with list of unique products and markets
df_unique_market_product=df6[['MARKET_ID_BLINDED','PRODUCT_ID_BLINDED']].drop_duplicates()
df_unique_market_product=df_unique_market_product.reset_index(drop=True)
lin_model_stats=pd.DataFrame() # dataframe to hold model stats
df_lin_failed_markets_products=pd.DataFrame() # dataframe to hold info for failed models
fail_count=0 # variable to hold count of failed models
for i in range(df_unique_market_product.shape[0]):
# print counts after every 500 models
if (i % 500 == 0) or (i==df_unique_market_product.shape[0]-1):
print("Completed ",str(i)+" Models..")
print("Models failed for "+ str(fail_count)+" Markets and Products..")
# call the function that builds the model
market=df_unique_market_product.loc[i,'MARKET_ID_BLINDED']
product=df_unique_market_product.loc[i,'PRODUCT_ID_BLINDED']
try:
lin_model_stats_this_iteration=func_build_price_elasticity_lin_model(this_market=market, this_product=product)
# incrementally append stats for successful models
lin_model_stats=pd.concat([lin_model_stats,lin_model_stats_this_iteration], ignore_index=True)
except:
# for every exce[tion, increase fail counter by 1
fail_count+=1
# create dataset to hold market and product for which the model failed
failed_markets_products_this_iteration={
'MARKET_ID_BLINDED' : market,
'PRODUCT_ID_BLINDED' : product,
}
df_lin_failed_markets_products_this_iteration = pd.DataFrame(failed_markets_products_this_iteration, index=[0])
# incrementally append info for failed models
df_lin_failed_markets_products=pd.concat([df_lin_failed_markets_products,
df_lin_failed_markets_products_this_iteration], ignore_index=True)
Completed 0 Models.. Models failed for 0 Markets and Products.. Completed 500 Models.. Models failed for 7 Markets and Products.. Completed 1000 Models.. Models failed for 11 Markets and Products.. Completed 1500 Models.. Models failed for 21 Markets and Products.. Completed 2000 Models.. Models failed for 36 Markets and Products.. Completed 2500 Models.. Models failed for 83 Markets and Products.. Completed 3000 Models.. Models failed for 139 Markets and Products.. Completed 3500 Models.. Models failed for 209 Markets and Products.. Completed 4000 Models.. Models failed for 255 Markets and Products.. Completed 4500 Models.. Models failed for 310 Markets and Products.. Completed 5000 Models.. Models failed for 379 Markets and Products.. Completed 5146 Models.. Models failed for 413 Markets and Products..
Here's the output stats data arranged in descending order of r-square for test set..
lin_model_stats.sort_values('R2_TEST', ascending=False)
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | COEFF_INTERCEPT_STD | COEFF_UNIT_PRICE_STD | COEFF_NUMBER_OF_STORES_SELLING_STD | P_VALUE_INTERCEPT | P_VALUE_UNIT_PRICE_STD | P_VALUE_NUMBER_OF_STORES_SELLING_STD | R2_TRAIN | R2_TEST | STD_UNIT_PRICE_ORIGINAL | STD_NUMBER_OF_STORES_SELLING_ORIGINAL | STD_ADJUSTED_PHYSICAL_VOL | MEAN_UNIT_PRICE_ORIGINAL | MEAN_NUMBER_OF_STORES_SELLING_ORIGINAL | MEAN_ADJUSTED_PHYSICAL_VOL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4265 | M00517 | P000396742388 | 6.80 | 0.02 | 7.18 | 0.00 | 0.84 | 0.00 | 0.99 | 0.99 | 2.02 | 36.37 | 7.22 | 31.96 | 32.98 | 6.80 |
| 4485 | M00208 | P001681734304 | 24.73 | -0.29 | 12.59 | 0.00 | 0.36 | 0.00 | 0.98 | 0.99 | 1.72 | 56.46 | 12.74 | 31.87 | 110.13 | 24.73 |
| 2212 | M00017 | P001446160983 | 5.86 | -0.14 | 9.08 | 0.00 | 0.51 | 0.00 | 0.99 | 0.99 | 3.75 | 41.00 | 9.01 | 20.69 | 26.62 | 5.86 |
| 2620 | M00077 | P003224111541 | 28.40 | -0.25 | 22.26 | 0.00 | 0.85 | 0.00 | 0.93 | 0.99 | 0.60 | 4.60 | 23.13 | 9.84 | 7.89 | 28.40 |
| 2096 | M00110 | P001446160983 | 1.38 | -0.12 | 2.97 | 0.00 | 0.04 | 0.00 | 0.98 | 0.99 | 2.71 | 24.55 | 3.04 | 21.22 | 8.03 | 1.38 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4580 | M00513 | P000063180125 | 0.44 | -0.17 | 0.23 | 0.00 | 0.10 | 0.04 | 0.79 | -12.19 | 0.72 | 0.35 | 0.34 | 10.27 | 1.14 | 0.44 |
| 2013 | M00207 | P001512602952 | 0.66 | -2.00 | -2.04 | 0.00 | 0.23 | 0.22 | 0.08 | -24.86 | 0.57 | 0.20 | 0.33 | 26.17 | 1.04 | 0.66 |
| 2253 | M01316 | P000908305120 | 0.72 | -0.35 | 0.00 | 0.00 | 0.00 | NaN | 0.76 | -161.49 | 2.65 | 0.00 | 0.40 | 15.11 | 1.00 | 0.72 |
| 2594 | M01015 | P095311041806 | 1.24 | -0.51 | 3.48 | 0.00 | 0.12 | 0.00 | 0.78 | -252.91 | 3.32 | 22.07 | 4.17 | 32.94 | 5.92 | 1.24 |
| 2997 | M00110 | P159713991336 | 19.50 | 0.45 | 67.53 | 0.00 | 0.40 | 0.00 | 1.00 | -820.26 | 1.37 | 25.75 | 67.32 | 9.63 | 8.18 | 19.50 |
4733 rows × 16 columns
Model generation failed for the following 414 combiantions or markets and products..
df_lin_failed_markets_products
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | |
|---|---|---|
| 0 | M00217 | P018021181526 |
| 1 | M00077 | P004983481087 |
| 2 | M00513 | P018021181526 |
| 3 | M01015 | P002496971635 |
| 4 | M00077 | P001753472153 |
| ... | ... | ... |
| 409 | M00077 | P005416551691 |
| 410 | M00045 | P039714651260 |
| 411 | M00077 | P000023761440 |
| 412 | M00077 | P000517388311 |
| 413 | M00077 | P005632991804 |
414 rows × 2 columns
Saving the model results on disk for later use..
lin_model_stats.to_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/lin_model_stats.pkl')
df_lin_failed_markets_products.to_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df_lin_failed_markets_products.pkl')
#model_stats= pd.read_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/model_stats.pkl')
#df_failed_markets_products= pd.read_pickle('/Users/atulsehgal/Documents/UofU/MSBA Study/IS 6813-001 Spring 2023 MSBA CapstoneCompletion/Datasets/df_failed_markets_products.pkl')
Just like linear regression, in the step below, we are creating random forest regression models. One model is for one combination of markets and products. We intend to have 5,146 models to be exact as we have 5,146 combinations of markets and products.
This model creation function is called in loop and performs the following steps:
1) Subsets the data for a given product and market
2) Standardize the independent variables [by calling the function declared above]
3) Fits random forest regression model on training set
4) Validates the model on test set
5) Collects modeling stats [R-sq]
This function is called iteratively to fit one model per market and product combination.
# suppress FutureWarning
warnings.filterwarnings('ignore', category=FutureWarning)
# create dataset with list of unique products and markets
df_unique_market_product=df6[['MARKET_ID_BLINDED','PRODUCT_ID_BLINDED']].drop_duplicates()
df_unique_market_product=df_unique_market_product.reset_index(drop=True)
rf_model_stats=pd.DataFrame() # dataframe to hold model stats
df_rf_failed_markets_products=pd.DataFrame() # dataframe to hold info for failed models
fail_count=0 # variable to hold count of failed models
for i in range(df_unique_market_product.shape[0]):
# print counts after every 500 models
if (i % 500 == 0) or (i==df_unique_market_product.shape[0]-1):
print("Completed ",str(i)+" Models..")
print("Models failed for "+ str(fail_count)+" Markets and Products..")
# call the function that builds the model
market=df_unique_market_product.loc[i,'MARKET_ID_BLINDED']
product=df_unique_market_product.loc[i,'PRODUCT_ID_BLINDED']
try:
rf_model_stats_this_iteration=func_build_price_elasticity_rf_model(this_market=market, this_product=product)
# incrementally append stats for successful models
rf_model_stats=pd.concat([rf_model_stats,rf_model_stats_this_iteration], ignore_index=True)
except:
# for every exce[tion, increase fail counter by 1
fail_count+=1
# create dataset to hold market and product for which the model failed
failed_markets_products_this_iteration={
'MARKET_ID_BLINDED' : market,
'PRODUCT_ID_BLINDED' : product,
}
df_rf_failed_markets_products_this_iteration = pd.DataFrame(failed_markets_products_this_iteration, index=[0])
# incrementally append info for failed models
df_rf_failed_markets_products=pd.concat([df_rf_failed_markets_products,
df_rf_failed_markets_products_this_iteration], ignore_index=True)
Completed 0 Models.. Models failed for 0 Markets and Products.. Completed 500 Models.. Models failed for 7 Markets and Products.. Completed 1000 Models.. Models failed for 11 Markets and Products.. Completed 1500 Models.. Models failed for 21 Markets and Products.. Completed 2000 Models.. Models failed for 36 Markets and Products.. Completed 2500 Models.. Models failed for 83 Markets and Products.. Completed 3000 Models.. Models failed for 138 Markets and Products.. Completed 3500 Models.. Models failed for 208 Markets and Products.. Completed 4000 Models.. Models failed for 254 Markets and Products.. Completed 4500 Models.. Models failed for 309 Markets and Products.. Completed 5000 Models.. Models failed for 378 Markets and Products.. Completed 5146 Models.. Models failed for 412 Markets and Products..
Here's the output stats data arranged in descending order of r-square for test set..
rf_model_stats.sort_values('R2_TEST', ascending=False)
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | R2_TRAIN | R2_TEST | |
|---|---|---|---|---|
| 4486 | M00208 | P001681734304 | 1.00 | 1.00 |
| 2212 | M00017 | P001446160983 | 1.00 | 0.99 |
| 4266 | M00517 | P000396742388 | 1.00 | 0.99 |
| 3925 | M00320 | P007911351239 | 0.99 | 0.99 |
| 4068 | M01119 | P001821311984 | 0.98 | 0.99 |
| ... | ... | ... | ... | ... |
| 2400 | M01316 | P010231448381 | 0.86 | -10.30 |
| 4284 | M00517 | P005771283617 | 0.89 | -17.03 |
| 4475 | M00520 | P001209217779 | 0.80 | -23.91 |
| 2594 | M01015 | P095311041806 | 0.90 | -104.14 |
| 2998 | M00110 | P159713991336 | 0.86 | -168.99 |
4734 rows × 4 columns
Model generation failed for the following 413 combiantions or markets and products..
df_rf_failed_markets_products
| MARKET_ID_BLINDED | PRODUCT_ID_BLINDED | |
|---|---|---|
| 0 | M00217 | P018021181526 |
| 1 | M00077 | P004983481087 |
| 2 | M00513 | P018021181526 |
| 3 | M01015 | P002496971635 |
| 4 | M00077 | P001753472153 |
| ... | ... | ... |
| 408 | M00077 | P005416551691 |
| 409 | M00045 | P039714651260 |
| 410 | M00077 | P000023761440 |
| 411 | M00077 | P000517388311 |
| 412 | M00077 | P005632991804 |
413 rows × 2 columns
Let's compare the training set R-square of Linear Regression and Random forest models ..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.distplot(lin_model_stats['R2_TRAIN'], color='red', label='Linear Regression')
sns.distplot(rf_model_stats['R2_TRAIN'], color='blue', label='Random Forest Regression')
p.set_xlabel("R-sq", fontsize = 20)
p.set_ylabel("Density", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('R-square : Training Set', fontsize = 26)
plt.title('Comparison between Linear and Random Forest Regression', fontsize = 14)
plt.legend(fontsize=14,loc='upper left')
<matplotlib.legend.Legend at 0x7faf17ec8e20>
Random Forest Regression models clearly have better R-square in training set than Linear Regression models..
Random forest regression is able to capture non-linear relationships between variables, it can often provide a better fit to the training data than linear regression. This can result in a higher R-squared value on the training set, indicating a better overall fit of the model to the data.
A high R-squared value on the training set does not necessarily indicate that the model will perform well on new, unseen data. Overfitting can occur when a model fits too closely to the training data and is not able to generalize well to new data. It is important to evaluate the model's performance on a separate validation or test set to ensure that it is not overfitting.
Let's compare the test set R-square of Linear Regression and Random forest models ..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.distplot(lin_model_stats['R2_TEST'], color='red', label='Linear Regression')
sns.distplot(rf_model_stats['R2_TEST'], color='blue', label='Random Forest Regression')
p.set_xlabel("R-sq", fontsize = 20)
p.set_ylabel("Density", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('R-square : Test Set', fontsize = 26)
plt.title('Comparison between Linear and Random Forest Regression', fontsize = 14)
plt.legend(fontsize=14,loc='upper left')
<matplotlib.legend.Legend at 0x7faf18ad9a90>
The plot is not very clear and we need to zoom in..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.distplot(lin_model_stats['R2_TEST'], color='red', label='Linear Regression')
sns.distplot(rf_model_stats['R2_TEST'], color='blue', label='Random Forest Regression')
p.set_xlabel("R-sq", fontsize = 20)
p.set_ylabel("Density", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.xlim(-1,1)
plt.suptitle('R-square : Test Set', fontsize = 26)
plt.title('Comparison between Linear and Random Forest Regression', fontsize = 14)
plt.legend(fontsize=14,loc='upper left')
<matplotlib.legend.Legend at 0x7faf17c95b80>
The R-square for both models on test set don't look very promising.
In many cases, we see Negative R-square for both RF and Linear models
A negative R-squared value on a test set means that the regression model performs worse than a horizontal line, which is the line of best fit that represents the mean of the dependent variable. This means that the model is not a good fit for the data and it is likely making poor predictions.
However, it is also worth noting that a negative R-squared value on a test set can sometimes occur due to chance, particularly if the sample size is small or if the data is particularly noisy or complex.
Putting R-square of both training and test sets for both RF and linear models together in one view..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.scatterplot(data=lin_model_stats, x="R2_TRAIN", y="R2_TEST", alpha=0.1, color='red', label='Linear Regression')
sns.scatterplot(data=rf_model_stats, x="R2_TRAIN", y="R2_TEST", alpha=0.1, color='blue', label='Random Forest Regression')
plt.xlabel("R-sq: Training", fontsize = 20)
plt.ylabel("R-sq: Test", fontsize = 20)
plt.xticks(fontsize= 24)
plt.yticks(fontsize= 24)
plt.ylim(-1,1)
plt.suptitle('R-square : Training and Test Set', fontsize = 26)
plt.title('Comparison between Linear and Random Forest Regression', fontsize = 14)
plt.legend(fontsize=14,loc='upper left')
<matplotlib.legend.Legend at 0x7faf1c03af70>
Overall, it looks like RF models are better than linear models..
However, both types of models are not very promising.
The not so promising results are likely due to the following factors:
1) We are fitting the models at a very detailed grain [per market per product]
2) There are other factors in play that we have not considered.
The p-value is important in determining the significance of coefficients because it provides a statistical measure of how likely it is that the observed relationship between the predictor and outcome variables is due to chance or random variation. It helps us to assess whether the coefficient is a true reflection of the underlying relationship between the variables, or whether it is simply a result of noise or sampling variability.
Let's examine the p-value of Unit Price from linear regression models..
p_value_significant=round(sum(lin_model_stats['P_VALUE_UNIT_PRICE_STD']<=0.05)/lin_model_stats.shape[0]*100,1)
p_value_insignificant=100-p_value_significant
print("Models with statistically significant Unit Price= "+str(p_value_significant)+"%")
print("Models with statistically insignificant Unit Price= "+str(p_value_insignificant)+"%")
Models with statistically significant Unit Price= 57.2% Models with statistically insignificant Unit Price= 42.8%
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.distplot(lin_model_stats['P_VALUE_UNIT_PRICE_STD'], color='red', label='Linear Regression')
p.set_xlabel("R-sq", fontsize = 20)
p.set_ylabel("Density", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('p-value : Unit Price', fontsize = 26)
plt.title("Models with statistically significant Unit Price= "+str(p_value_significant)+"%", fontsize = 20)
plt.legend(fontsize=14,loc='upper right')
plt.axvline(x=0.05,
color="black",
linestyle="dashed")
plt.text(0.05, 14, 'P Value=0.05', horizontalalignment='right',verticalalignment='center',size=20, color='black', weight='semibold',rotation=90)
Text(0.05, 14, 'P Value=0.05')
In 57.2% cases [markets and products], price appears to have significant relationship with sales volume..
Price elasticity is a measure of the responsiveness of the demand for a product or service to a change in its price. It is a ratio of the percentage change in quantity demanded to the percentage change in price.
More formally, the price elasticity of demand is defined as:
Price elasticity of demand = (% change in quantity demanded) / (% change in price)
Let's see what our models have to say about 'Price Elasticity'
In linear regression, the coefficients provide both the magnitude and direction of the relationship between the predictor variable and the outcome variable. The sign of the coefficient (positive or negative) indicates the direction of the relationship.
If the coefficient is positive, it indicates that there is a positive relationship between the predictor variable and the outcome variable. This means that as the predictor variable increases, the outcome variable also increases. Hence, a positive coefficient for 'unit price' indicates that as unit price' increases, sales also tend to increase.
If the coefficient is negative, it indicates that there is a negative relationship between the predictor variable and the outcome variable. This means that as the predictor variable increases, the outcome variable decreases. Hence, a negative coefficient for 'unit price' indicates that as unit price' increases, sales tend to decrease.
coeff_negative=round(sum(lin_model_stats['COEFF_UNIT_PRICE_STD']<=0)/lin_model_stats.shape[0]*100,1)
coeff_positive=100-coeff_negative
print("Models with negative relationship between Price and Volume= "+str(coeff_negative)+"%")
print("Models with positive relationship between Price and Volume= "+str(coeff_positive)+"%")
Models with negative relationship between Price and Volume= 81.0% Models with positive relationship between Price and Volume= 19.0%
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.distplot(lin_model_stats['COEFF_UNIT_PRICE_STD'], color='red', label='Linear Regression',
kde=True, hist=False, kde_kws={'fill': True})
p.set_xlabel("R-sq", fontsize = 20)
p.set_ylabel("Density", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.suptitle('Coefficient Value : Unit Price [Standardized]', fontsize = 26)
plt.title("Models with negative relationship between Price and Volume= "+str(coeff_negative)+"%", fontsize = 20)
plt.legend(fontsize=14,loc='upper right')
plt.axvline(x=0,
color="black",
linestyle="dashed")
<matplotlib.lines.Line2D at 0x7faf16070220>
Zooming in..
plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
p=sns.distplot(lin_model_stats['COEFF_UNIT_PRICE_STD'], color='red', label='Linear Regression',
kde=True, hist=False, kde_kws={'fill': True})
p.set_xlabel("Coefficient for Unit Price", fontsize = 20)
p.set_ylabel("Density", fontsize = 20)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.xlim(-3000,1000)
plt.suptitle('Coefficient Value : Unit Price [Standardized]', fontsize = 26)
plt.title("Models with negative relationship between Price and Volume= "+str(coeff_negative)+"%", fontsize = 20)
plt.legend(fontsize=14,loc='upper right')
plt.axvline(x=0,
color="black",
linestyle="dashed")
<matplotlib.lines.Line2D at 0x7faf161f2e80>
Since, in 81% cases we have negative relationship between 'Unit Price' and 'Sales Volume', we can conclude that..
In 81% cases as unit price' increases, sales tend to decrease.
While random forest regression is a powerful and flexible modeling approach, it can be difficult to interpret the results because of the complex interactions between the predictor variables. PDPs help to address this problem by providing a graphical representation of the relationship between a specific predictor variable and the outcome variable, while holding all other predictor variables constant.
Let's generate random 50 examples of partial dependence plot for 'Unit Price'..
# create dataset with list of unique products and markets
df_unique_market_product=df6[['MARKET_ID_BLINDED','PRODUCT_ID_BLINDED']].drop_duplicates()
df_unique_market_product=df_unique_market_product.reset_index(drop=True)
for i in [random.randint(0,df_unique_market_product.shape[0]) for _ in range(50)]:
market=df_unique_market_product.loc[i,'MARKET_ID_BLINDED']
product=df_unique_market_product.loc[i,'PRODUCT_ID_BLINDED']
try:
func_plot_partial_dependence_for_unit_price(this_market=market,
this_product=product)
except:
print("Plot generation failed for Market =", market,"and Product =",product)
Plot generation failed for Market = M00513 and Product = P007851279737 Plot generation failed for Market = M00045 and Product = P012461050498 Plot generation failed for Market = M00520 and Product = P111716121631
In the random 50 plots above we can see that, mostly as the unit price' increases, sales tend to decrease, however, there a few exceptions to this rule
Putting it all together, we do not have very promising models for 'Price Elasticity' at the grain of Market and Product, but we do have some good directional insights.
In 57.2% cases [markets and products], price appears to have significant relationship with sales volume..
In 81% cases as unit price' increases, sales tend to decrease.
These models can be operationalized via an Price Optimization Engine..
Using this Price Optimization Engine, we can determine the optimal price for product in each market. The goal of a price optimization engine can be to maximize revenue or profit while taking into 'Price Elasticity' into account.
If we were to develop a quick prototype of this Optimization Engine in excel, linear regression would be a good choice as we can plug in the coefficients in excel and use a simple greedy algorithm approach [even though random forest regression appears to be performing better than linear].